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Retargeting of Existing FORTRAN Program 

and 

Development of Parallel Compilers 


SUMMARY 

This report describes the software models used in implementing paralleliz- 
ing compiler for the B-HIVE multiprocessor system. The various models and 
strategies used in our compiler development are: 

1) Flexible granularity model, which allows a compromise between two 
extreme granularity models; 

2) Communication Model, which is capable of precisely describing the 
interprocessor communication timings and patterns; 

3) Loop Type detection strategy, which identifies different types of loops, 
such as DOALL, DOACR, and DOSEQ; 

4) Critical Path with Coloring scheme, which is a versat ile; scheduling 
strategy for any multicomputer with some associated carnmunication 
costs; 

5) Loop Allocation Strategy, which realizes optimum me'lappcd opera- 
tions between computation and communication of the sjsterr 

Using these models, several sample routines of AIR3D package are exam- 
ined and tested. It may be noted that automatically generated codes are highly 
parallelized to provide maximize degree of parallelism, obtaining the speedup 
up to 28 on a 32-processor system. A comparison of parallel codes for both 
existing and proposed communication model, is performed and the correspond- 
ing expected speedup factors are obtained. The experimentation shows that 
the B-HIVE compiler produces more efficient codes than existing techniques. 

Working is progressing very well in completing the final phase of the com- 
piler. Numerous enhancements are needed to improve the capabilities of the 
parallelizing compiler. 
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CHAPTER 1 


INTRODUCTION 


Parallel computation is crucial in shortening the turnaround times for 
massive scientific computations, such as computer aided design applications, 
computational fluid dynamics, and weather forecastings. Numerous efforts 
have been made to increase the speedup for such algorithms on various parallel 
processors, such as vector processors [SwJ85], shared memory multiprocessors 
[EBS84], and private memory multiprocessors [Cat87]. The major effort in util- 
izing multiprocessor systems for computational fluid dynamics lies in the way 
of increasing the degree of parallel operations and cutting down the turnaround 
time that is extremely long on a uniprocessor machine. Navier-Stokes algo- 
rithm that requires, massive computations, is a fundamental simulation model 
for computat ional fluid dynamics needed in designing a high speed aircraft. 

Architectural innovation in multiprocessors has also encouraged the 
development, of ne * 'ools m exploring the capabilities of the parallel machines 
[Fly72|. Trad :..aJ.y coded programs are not directly suitable for the new 
architectures, an ' js /ieXv they must be reorganized or modified to utilize the 
power of the new i .achmes For this purpose, either we can develop a parallel- 
izing compiler that transforms a sequential source program into parallel code 
automatically, as described in [A1183, A1K85, FER84, KKP81, PKL80, LAM87, 
LKA88], or we could provide programmers some sort of parallel programming 
environment such as seen in new programming languages [Hoa78, Ahu86], and 
in the extended languages [Han77, KuS85, Sha86] so that the parallelism could 
be specified manually. 

The disadvantages of having new parallel language is the need for a new 
compiler and. the mandatory need for rewriting the whole program all over 
again for a new machine. 

We have concentrated our efforts in retargeting existing software. The 
advantages of having parallelizing compiler are its “user friendliness” and high 
reusability. The user friendliness means that a user can get the parallel codes 
without learning the new parallel language. Reusability implies that the exist- 
ing programs can be run on a new parallel architecture simply by recompiling 
the existing source code. The drawbacks of having a parallelizing compiler are 
the higher compiling cost (due to the incorporation of automatic parallelism 
detection algorithm), and a somewhat lower degree of parallelism (because of 
inherently sequential algorithms or programmer’s coding styles). 
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Most past efforts on parallelizing compilers have been concentrated on the 
loop structures. For example, parallelism in loops is extensively analyzed in 
[PoB87, PKP86], the granularity considered in [Cve87], and dynamic schedul- 
ing covered in [PoK87]. These efforts are mainly concentrated on the paralleliz- 
ing the codes for a shared memory multiprocessor system in which an interpro- 
cessor communication overhead is negligible. In a loosely coupled distributed 
memory multiprocessor system, however, communication overhead is large so 
that run time task distribution itself would require excessive amount of time. 
This limitation basically encouraged the use of the static scheduling in which 
the task allocation is done before run time. 

Navier-Stokes program has been restructured and tested under the static 
scheduling strategy on a two- VAX 11/780 based shared memory multiproces- 
sor, and the speed-up of 1.9 [EBS84] has been reported. This result is not too 
encouraging and has forced to look into possible utilization of distributed 
memory multiprocessors for computational fluid dynamics and other scientific 
computations. This report covers various strategies that have been employed 
in building a parallelizing compiler for a distributed memory multiprocessor 
environment and describes the computation models used for the AIR3D pack- 
age, a version of Navier-Stokes algorithm provided by the NASA. Only the 
static scheduling strategy is considered as the parallelizing compiler has been 
developed lor the ooselv coupled B-HIVE multiprocessor system [AAG86]. 
The B-HIVE multiprocessor system is a 24-node generalized hypercube based 
machine designed and built at the North Carolina State University. Each node 
in the B-HIVE system consists of a pair of processors, an application processor 
and a communication processor, which communicate with each other through a 
fast dual-port shared memory. 

In Chapter II, we describe the structure of the B-HIVE parallelizing FOR- 
TRAN compiler. This chapter also describes the parallel software model and 
communication model employed in the parallelizing compiler, and is accom- 
panied with few examples. In Chapter III, we consider several AIR3D routines 
extensively to show how the proposed software models and strategies are used 
in the B-HTVE compiler. We also show the parallel codes for several sample 
routines. Performance simulation and test results on some AIR3D subroutines 
with proposed model is given in Chapter IV. Finally, the current status of the 
project is summarized, and the future plans are also included. 



CHAPTER 2 


B-HIVE PARALLELIZING COMPILER AND SOFTWARE MODELS 


B-HIVE parallelizing compiler accepts a sequential code and produce a 
parallel code. It, automatically and interactively, detects parallelism of the 
source codes, determines the type of loops, allocates parallel tasks, and finally 
generate parallel codes. This chapter describes the parallelism detection and 
utilization strategies used by the B-HIVE parallelizing compiler. 

Figure 1 outlines seven phases in the compilation. The front end of the 
compiler includes lexical and syntactic analysis of the source codes written in 
FORTRAN. The second phase of the compiler reads syntactically verified 
source codes and builds a program tree in which nodes and arcs represent tasks 
and program control flows respectively. Source cader are divided into a set of 
tasks according to the program’s natural boundaries, such as loop bodies, com- 
parison bodies, subroutine, and basic blocks \ basic block includes a 
sequence of consecutive '-'Hlements without interior branching or stopping. 
Within each basic block, tin « xecution order of t»e statements is then carefully 
analyzed based on the data dependence relations between every pair of state 
ments. Defining weak ’ dependence relation between sets of statements by a 
precedence relation, such that allocation of every set of statements onto two 
different processors does not delay the completion time of a basic block, group- 
ing statements into a set of tasks will improve or at least not worsen the perfor- 
mance if every pair of tasks are weakly dependent on each other. In other 
words, by grouping “strongly” dependent statements into a task, such that a 
precedence relation between a pair of the statements delays execution if the two 
statements of every pair are allocated onto different processors. This property 
of grouping statements inside basic blocks provides another source of potential 
parallelism of programs besides the parallel loops. 

The potential parallelism of basic blocks is explored in the third phase of 
the compiling process, forming a hierarchy of tasks based on the data depen- 
dence relation between every pair of tasks. In a distributed memory multipro- 
cessor environment, a communication overhead is inevitable so that grouping 
strongly dependent statements into a task is crucial to eliminate excessive com- 
munication overhead. A task is represented by a “grain” which is basically a 
group of operations. Partitioning a basic block into several grains is character- 
ized as a granularity model. In a fine granularity model, each grain consists of 
only few operations. For instance, in an extreme case, such as a data flow com- 
putation, each grain contains one operation and all grains are concurrently 
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issued for better utilization of the resources. The drawback of this model is 
that it tends to initiate a lot of communication and synchronization traffic 
among the processors. Hence, this model is not appropriate for a distributed 
memory multiprocessor environment. Another extreme example is a coarse 
granularity model. Every grain has big chunk of computations, such as pro- 
cedure level seen in a distributed computing environment wherein the commun- 
ication overhead could be reduced to the minimum level at the expense of the 
degree of parallelism [Bab84]. Flexible granularity model is a compromise, 
resulting in the medium size grains. It begins exploring all potential parallelism 
of the statements within each basic block, producing a set of fine grains, and 
then regroups cohesive fine grains into medium grains, thereby decreasing the 
communication overhead while retaining the parallelism among the grains. 

In the fourth phase, every basic block is replaced with a block data depen- 
dence graph, and a global data dependence graph of the program tree is built 
to ensure proper code synchronization throughout the entire program. The 
information obtained in this phase is turned into communication primitives in 
the code synthesis: phase The next phase allocates all the tasks onto a limited 
number of processors Optimal allocation of m tasks to n processors is a well- 
known NP-complefe probh in [Sto77]. Thus, it is desirable to devise a heuristic 
algorithm which give- -i m ar optimal answer within a reasonable amount of 
time [Efe82]. The ilb>< ation phase takes into account the computation times of 
the tasks and the conmunn at ion costs among them. Communication is con- 
sidered according to the overlapped computation/communication model 
described below and in [LAM87]. Finally, the synthesis phase reorganizes the 
sequential codes and allocate them onto a number of processors inserting the 
communication primitives and ultimately producing a separate (cooperating) 
program for each processor. 
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Sequential Code 



Figure 1. The Structure of the B-HIVE Parallelizing Compiler. 


1. Program Division 

The program divider builds a program hierarchy tree for a given program 
into “non-leaf nodes” (that usually has child node) and “leaf node” (that has 
no child nodes). Control statements, such as loop header, comparison state- 
ments are sources of non-leaf nodes, and basic blocks become leaf nodes. For 
example, every DO-loop is considered as a non-leaf node, and then, every 
nested loop is represented as a parent-child relation. The algorithm is 
described in Figure 2. All program hierarchies shown in this report are built by 
using this algorithm. 
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input : Sequential Program 
output : Program Hierarchy Tree 

1. Create root node of the hierarchy tree with the name of the program. 

2. while not done 

2.1. Create a new child node for the current node 

2.2. repeat 

Put statement into the child node of 2.1 
until boundary statement is encountered. 

2.3 Case A: { a starting point of a structure } 

Create a new child node for the current node. 

Current node := new child node of the current node. 

2.4. Case B: { an ending point of a structure } 

Current node := parent node of the current node. 


Figure 2. Program Division Algorithm 


For a given program with n statements, the time complexity of the pro- 
gram division algorithm is O(n), since every statement in the given program is 
visited exactly once in the algorithm while the time needed to process a state- 
ment is constant 


2. Flexible Granularity Model 

Basic block partitioning begins with building a “block data dependence 
graph”, in which all potential parallelisms of a basic block is exposed. The 
problem with a fine grain graph is that it contains too many nodes, and too 
many communication arcs between nodes. To reduce the communication over- 
head, cohesive fine grains are grouped into medium grains, decreasing the 
number of arcs while retaining the parallelism among the grains. Cohesive fine 
grains are a set of connected nodes in a block dependence graph. In other 
words, there must be either a direct or an indirect precedence relation between 
every two nodes. 

Three precedence relations are considered in partitioning. The “output 
dependence” relation occurs when two statements update the same variable. 
This dependence relation can be easily solved by allocating the statements onto 
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different processors, where the interference is removed by the use of private 
memory, or by maintaining the original order of execution if they are allocated 
onto the same processor. Thus, output dependence is not an obstacle to paral- 
lelism. The “anti-dependence” represents the relation that a statement uses a 
value that the following statement modifies. This relation can be resolved by 
allocating each task onto different processors in which every processor reads the 
value into its own private memory before execution. This enables several pro- 
cessors to use and to modify concurrently without any conflicts. The “flow 
dependence” occurs when a statement produces a value that is used by a suc- 
cessive statement. Whenever a pair of statements that have flow dependence 
relation are allocated onto different processors, the interprocessor communica- 
tion is unavoidable, which implies excessive communication overhead. To 
prevent this, we need a more systematic mechanism to reduce the interproces- 
sor communication by grouping the statements along the flow dependence arc. 

A memory aliasing in a distributed memory multiprocessor requires more 
careful consideration for grain grouping. Assuming array elements A[«] and 
A[j\, such that i = j at run time, it is unknawir whether' r—./ at compile time. 
For example, given 

51 i = 

5 2 A[i] - 

5 3 read (j) 


s 4 A[jj = • • 



S 5 depends on either S 2 or ^4 since j is unknown at compile time. 

Let T p : S,- — S ; be a task set that has statements S,- and Sj , such that 5, 
must be preceded by Sj. Then partitioning into two tasks, such as ^ : 
Sj — 52 — S 5 and T 2 '• S 3 — S 4 causes confliction if T x and T 2 reside in the 
different processors, since S 5 must refer to the processor that has A[j] if j = i. 
Similar situation is observed among the tasks that retain anti dependence rela- 
tion. To solve this problem, a pair of array elements, that retain any pre- 
cedence relations should be grouped together. We denote dependence relation 
among array elements as “array dependence relation” through the paper. 

Flexible granularity model performs grain grouping, so as to eliminate 
excessive communication traffic while retaining all, or at least most, parallelism 
detected during the data dependence analysis. “Vertical partitioning” provides 
a way of grouping grains, in which every pair of fine grains has either direct or 
indirect precedence relations are fused together forming a medium grain. By 
labeling the dependence arc between every pair of grains with the correspond- 
ing communication cost, we can determine every path cost along the arcs. 
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Assuming as if we allocate the longest path onto a processor, the next longest 
path to anther processor, and so on, we obtain a number of parallel medium 
grains that have fine grains on every corresponding path. However, the task 
sets obtained solely by vertical partitioning do not promise optimum perfor- 
mance, as the intertask synchronization between pairs of tasks some times defer 
completion time compared to the case that the task pairs are allocated onto the 
same processor. To avoid this circumstance, we propose a versatile grain 
grouping algorithm based on “list scheduling” technique. 

List schedulings [CoG72, KaN84] are a class of implementable static 
scheduling methods in which tasks are assigned priorities and placed in a list 
ordered in descending magnitude of priority. Whenever executable tasks con- 
tend for processors, the selection of tasks to be immediately processed is done 
on the basis of priority with the higher priority tasks executable being assigned 
processors first. This characteristic of list schedulings tends to evenly distribute 
tasks over all processors based on the load balancing criterion. 

Our flexible granularity model solves both grain grouping and load balanc- 
ing problems. It begins with finding the longed, path" task in the vertical parti- 
tioning technique. We delete communication overhead within the longest path 
task, of which the cost implies actual computation tune of the task. Then 
confirm whether other path tasks increase the longest path task cost due to 
intertask communication between the long^i p ♦ * h ta^k and other one. If there 
is a task that defers the completion timt of tin longest path task, this task is 
merged to the longest path task until no more task affects the completion time 
of the longest path task. During the task merging process, grains are reordered 
according to the dependence relation between a pair of grains. We name the 
proposed algorithm “Critical Path with Coloring” (CP/C) algorithm and sum- 
marize it in Figure 3. 

The advantages of CP/C algorithm are: 

1) it can partition a basic block into a set of weakly dependent parallel tasks 
with 0(n 3 ); 

2) it represents a task with two timing values, the earliest finishing time, Ty 
and the latest starting time t, of the task, and with a color number that 
represents the task group number; 

3) it can simulate the near optimum completion time of a basic block on an 
infinite number of processors, where the completion time is determined by 
solely the longest path task cost; 

4) in the worst case, a basic block is assumed a task so that it does not allow 
excessive partitioning. 

Consider the example that has several flow dependence relations among state- 
ments with arbitrary execution costs, as shown in Figure 4. Flow dependence 
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relations are identified, such as S x S 2 S 3 , 5 4 -5 5 and S 2 — S 5 . Let depen- 
dence relation be replaced with same communication cost, t c . If communica- 
tion cost is enough small to simulate a shared memory multiprocessor architec- 
ture, for instance, t c ^400, task partitioning, such as T 1 :S 1 — S 2 ~S 3 and 
T 2 : S 4 ~S 5 is the optimum way to shorten the completion time. If communi- 
cation cost simulates a distributed memory architecture, for instance, t c = 1000, 
task partitioning, such as T 1 : Sy — S 2 - S 3 - S 5 and T 2 : S 4 is the optimum par- 
titioning. If t c >1000, all statements of the example becomes a single task since 
any partitionings worsen the result. 


input : Basic Block 
output : Parallel Task Set 

1. Build Array Dependence Graph, G 

2. Append Flow Dependence Arcs of non-array Symbols to G 

3. Label Dependence Arcs with Corresponding Communication Cost 

4. while not empty ( G ) do 

4.1. for ever} path of (7, G p do 

Costfp] V Grain Computation cost ^-^Communication cost 

4.2. Find tin lon*»<si path *„ 

^ in »* 

4.3. Klimmate ( onuiiui ication cost from G„ 

r m\x 

4.4. Put color number to every grain of C Pm> 

4.5. while not finished do 

Recalculate Cost[p] 

if task T, exists, such that 7\ mostly affects Cost[p] then 
merge T, to G Pmmx 
else set finished 

4.6. Put every grain on G Pm ^ into a Task set, T p 

4.7. Compute earliest finishing time and latest starting time of T p 

4.8. Delete G 0 from G 

P mix 


Figure 3. Critical Path with Coloring Algorithm. 
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S x (1000) 

X = • • ■ 

s 2 (200) 

Y = X+ • ■ 

S 3 (500) 

Z = X * Y * 

5 4 (700) 

V = • • • 

s 5 (100) 

W = Y + V 


(a) Basic Block 



Task Latest 
No Starting Time 

1 0 

2 600 


(b) Task Graph (t c = 100) 



Earliest 
Finishing Time 

1700 

1400 


Earliest 
Finishing Time 


1 800 
700 


(c) Task Graph (t c = 1000) 


Figure 4. Sample Basic block Partitioning based on CP/C Algorithm. 


Returning to the allocation problems, we have to allocate a number of task 
set onto a limited number of processors. Whenever we have enough processors 
to allocated every task set, we can allocate every task onto different processors 
as vertical partitioning did. However, if we have a fewer processors than the 
number of tasks, we have to combine multiple tasks onto a single processor in 
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an efficient way not to sacrifice the parallelism. 


3. Communication Model 

The purpose of scheduling is to allocate tasks onto a number of processors 
so as to achieve the optimum performance of the process. This goal can be 
accomplished by maximizing the utilization of available processors, while the 
communication and synchronization activities between processors is kept to a 
minimum. Intrinsically an efficient computation on a distributed memory mul- 
tiprocessor is dependent upon not only the degree of parallelism of the pro- 
grams but the ratio of computation cost over interprocessor communication 
overhead. The minimizing of interprocessor communication overhead and 
parallelization are two conflicting objectives in the allocation process. The 
methods based solely on the first objective tend to utilize only few processors 
for the sake of reducing communication overhead. On the other hand, methods 
based solely on parallelization tend to increase the interprocessor communica- 
tion overhead. One way to compromise these conflicting objectives is to overlap 
computations and communications in the computation. 

The computation model used in [Cam85, CLE80, Lo83, I'atM, SaH86] 
have assumed that t h« outgoing precedence data are initiated at the end of the 
computation, and the u<< -on. .rig data have to be received prior to the computa- 
tion. We will T'fei to this view Model A. The assumptions encountered in 
model A may not maximize the computation / communication overlap when 
there is more than one operation in the task, as in medium grain tasks. 
Instead, in the proposed communication model, denoting Model B, all com- 
munication activities of each task are accurately represented. The times that a 
task produces outgoing data and the times it can proceed without the incoming 
data are labeled in the model. Assuming the communication overhead be lesser 
than the summation of the computation costs of parallelly executable portions 
of the tasks, Model B produces better parallel codes than does Model A , as 
indicated in Figure 5. 
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Processor 1 
X = ... 

Y = X+ ... 

Z = ... 

send Y to processor 2 
Processor 2 

receive Y from processor 1 
V= ... 
w = Y + V 

(a) Model A 

Figure 5. Expected Parallel Codes 


Processor 1 
X= ... 

Y = X+ ... 

send Y to processor 2 
Z = ... 

Processor 2 

V = ... 

receive Y from processor 1 
W = Y + V 

(b) Model B 

on the two Commui'ic ation models. 


In Model A . I’roressor 1 sends Y upon completion <.»! S,, and Processor 2 
receives Y h. f. -< performing 5 4 so that no overlapped oj- ration is allowed. In 
Model B, on f t < >»her hand, Y is sent as Processor 1 has updated it and is 
expected at the moment Processor 2 requests. Thus, Model B provides over- 
lapped operations of communication between Processor 1 and Processor 2, and 
computation of S :i and S 4 . 

Possibility of computation / communication overlap is examined by calcu- 
lation the “benefit factor”, which is a cost difference between parallel portions 
of the task computation costs and the communication costs. If the benefit fac- 
tor of a pair of tasks is positive, task allocation onto different processors will 
achieve better performance than that onto the same processor. Overlap opera- 
tion provides efficient ways of computation not only to a basic block, but also 
to non-parallelized loop, as DOACR does in [PoB87] and other types of tasks, 
such as DOALL loops and comparison blocks. We expand overlapped execution 
to loops along the examples in the next section. 


4. Loop Detection and Allocation 

Fundamental source of extensive parallelism of programs is the loops. In 
order to automate parallelism detection of loops, the compiler must look at the 
data dependence relations in the loops. If a loop contains array data then the 
data dependencies are somewhat more difficult to solve. If this is the case, an 
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array dependence relation has to be considered, that is more complex to process 
than flow dependence relations in the scalar mode. In this chapter we intro- 
duce techniques to detect loop types, such as DOALL, DOACR, and DOSEQ. 
DOALL is a loop such that every iteration has no precedence relations so that 
every iteration can begin simultaneously. Thus, DOALL loops can be distri- 
buted onto processors in an arbitrary manner. DOACR is a loop type such that 
iterations are weakly dependent so that overlapped operation may shorten the 
completion time of the loop. DOSEQ type is similar to DOACR type but for in 
which every iteration is strongly dependent. So distributing iterations of a 
DOSEQ loop does not necessarily improve the loop performance, and in most 
cases, they even worsen the performance due to interprocessor communication 
overhead. 

Every loop statement defines loop variable, lower bound, upper bound and 
step size. Given FORTRAN loop body 

DO 10 I = LB, UB, ST 
10 CONTINUE 

I,LB,UB and ST be a loop variable, a lower bound, a upper bound, and a step 
size, respectively. The number of iterations is then given by 

UB-LB + 1 
*"[ ST 

Then, the upper bound is normalized as 

UB norm = min ST + LB, Ub). 

We will assume that an upper bound UB , in this discussion, has been normal- 
ized for simplicity. 

We define “array index vector” be an integer number, that specifies which 
nested loop directly determines the indices of an array. If a loop index variable 
is used as an array index within a loop, array index vector of the array index 
position is set to “1” to indicate the array location is continuously changed per 
every iteration. We also define “displacement” to be a displacement of the 
array location in the direction of loop index variable. For the example shown 
below, Dj and Dj are the displacements for the loop variables I and J, respec- 
tively. 

L x DO 100 I = 1,10 
L 2 DO 100 J = 1, 10 

A(I +D { ) = B(J +Dj) 
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100 CONTINUE 

Seen in L 2 , array index vector of A is set to ‘O’ since the array element of A is' 
unchanged during the execution of L 2 . When the loop L x is the encountered, 
the array A’s location is dependent upon the loop variable. To identify the 
location of A is closely related ti the loop variable the array index vector of A 
is set to T. Thus assigning an array index vector to ‘1’ implies that every loop 
iteration modifies the location of the array element. Similarly, B's index vector 
to Ly is ‘2’, that implies the array B's location is modified at the next immedi- 
ate inner loop bodies. Seen in L 2 , B's array vector is set to T as is A’s array 
index vector to L x . 

After data dependence relation for the loop body has been constructed and 
array index vectors and displacements have been computed, the compiler is 
ready to test the loop types. It may perform some transformations on the 
source codes to increase parallelism of the loops. So far, the B-HIVE compiler 
does not include any of these transformations during the compilation, and only 
can detect loop types for the given structure of the program 

The loop type detection algorithm firstly performs a) DOALL type detec- 
tion algorithm that isolates DOALL loops, and then in the second pass, b) 
DOSEQ type detection algorithm that isolates DOSEQ loops from a pool of 
DOACR loops. Let iMsp I 4 ,A ) be the displacement of an array A of which k- th 
dimensional index express.on is expr , and Vect(A ,Lvar ,k) be the array index 
vector of A at k- th dimension for an loop index variable Lvar. Using the nota- 
tions defined above the following relations gives criteria to detect DOALL loops 
on a distributed memory multiprocessor environment: 

1) A loop is DOALL if none of arrays and scalar symbols are used and 
defined within the loop. 

2) A loop is DOALL if Disp(A d ,k)<Disp(A ri ,k) and 
Vcct(A d ,Lvar ,k)= Vect(A u ,Lvar ,k)= 1 for every pair of A d and A tt , where 
A d is an array element defined and A u is an array reference. 

Loop blocks that do not satisfy above two condition are assumed DOACR, and 
are left for DOSEQ loop detection. In this report we give an example to test 
the DOALL conditions instead of verifying the two conditions. The sample 
loop below is clearly a DOALL according to the condition 1). 

L x DO 100 I =1,10 

A(I+1) = C(I-l) 

100 CONTINUE 
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Consider the following loop body that is known to be parallelizable at a 
shared memory multiprocessor or pipeline processor environment: 

L x DO 100 I = 1,10 

S x A(I+1) = C(I) 

S 2 B(I) = A(I) x 2 

100 CONTINUE 

L x is not DO ALL in a distributed memory multiprocessor environment, 
unsatisfying the condition 2) since 

Vect(A d ,1, 1)= Vect(A u ,I,l)=l 

and 

Dtsp(A d ,l)=l>Disp(A u ,l) = 0 . 

where A d is A(/+l) and A u is A (I). The determination is correct since distri- 
buting iterations in a card shuffling fashion encounters communication over- 
head so as to send A[I+ 1) to the processor that executes the successive itera- 
tion and to receive A (/) from the predecessor. 

Loops that do not satisfy the above two conditions are considered as 
DOACR or DOSEQ type loops. The DOSEQ isolation is performed based on 
the communication mode! shown in Section 3. Assuming that every iteration of 
the loops be distributed evenly onto the processors, a set of variables of an 
iteration have to be sent to processors that need for successive iterations in 
them. Let t u (<J>) be the earliest time that a variable <t> has to be received in 
processors, and T^(d>) be the latest time that a variable is ready to send in a 
processor. Then the “benefit factor” of distributed computation BF for is 

BFq = t u ($>)+ t £ . - r d ($>)■ - t c 

where T£. is the execution cost of the loop body L,, and t c is the communica- 
tion cost. Thus, in the worst case, the benefit factor, BF ^ is 

BF min =mm(BF q^BF <j, 2 , • • • ,BF <j> n ) 

If BF ^ is positive, distributing iterations onto processors will provide better 
performance than allocating the loop on a few processors. Otherwise, alloca- 
tion on a single processor is preferable. Using the example above, array ele- 
ment A(*) is ready at the time S 2 has done and is received at the time S x 
starts. Thus, BF 0, and as a result, L x is a DOSEQ loop. We can formal- 
ize the loop type detection algorithm in Figure 6. 
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input : Loop body 

output : Loop Type, either DOALL or DOACR 

{ for a given loop body, L,-, split DOALL and DOACR } 
Loop_Type (L,) :=DOALL; 
while not done do 

for every symbol <f>, such that <1> U and exist do 
{ where <1^ and <I> U are a variable <I> defined and used within Z,,} 
if d> is scalar variable then 
Loop_Type(L, ) :=DOACR; 
done := true; 
else { if array then } 
for every dimension, k do 
if ( Vect(<t> u ,L i var,k)= Vect(^> d ,L i var ,k)= 1) 
and ( Disp(& d ,k)>Disp( f t> u ,k ) then 
Loop_Type(L,) :=DOACR; 
done := true; 
else { skip } 

(a) DOALL Type Detection Algorithm 


input : DOACR loop 

output : Loop Type, either DOACR or DOSEQ 

{ for a given loop body, L if split DOACR and DOSEQ } 
while not done do 

for every symbol <£>, such that d and <!> u exist do 
{ where and $> tt are a variable 3> defined and used within L, } 
Compute BF(<t>)] 
if t c then 

Loop_Type(f/,) :=DOSEQ; 
done := true; 

else { continue for next symbol } 

(b) DOSEQ Type Split Algorithm 


Figure 6. Loop Type Detection Algorithm 
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Parallel loop allocation on a distributed memory multiprocessor should 
consider a parallelizing overhead caused either by a number of iterations or by 
a storage allocation. A large number of iterations is preferable as many as pos- 
sible. If loop bounds are known at compile time, the compiler can determine 
the possible allocations based on a economic analysis. If a number of iterations 
of a loop block is quite small, relatively large communication overhead will 
force the loop block to be allocated to a single processor. If a loop bound is not 
known at compile time, the economic analysis fails, and one should select possi- 
ble parallelization in manual or interactively. The B-HIVE compiler assumes 
unknown loop bounds be enough big to utilize the processors, unless compile 
directive switch “SEQ” is preceded to a loop block so as to force the loop to a 
DOSEQ type. 

Loop blocks deal with array elements in the loop body, and in most cases, 
array elements are linearly correspondent to the loop variable. That means the 
array elements have to be distributed along the loop partitioning strategies 
before loop execution. In order to prevent unnecessary waiting time to receive 
the immediately requested variables, sending the variables as soon as they are 
ready will reduce the waiting time. Similarly after loop execution updated 
arrays have to be gathered at the compile time known processor for the succes- 
sive references. We define the location of an array as "origin processor” of the 
array. 

Parallel loop allocations always encounter two communication overheads, 
before and after a loop execution. For a DOaLL loop given 

L x DO 100 I =1,10 

A(I-l) = B(I+2) 

100 CONTINUE 

B has to be distributed before L x begins execution, and A has to be sent back 
to the origin processor of A after L t has finished. 

Two iteration distributing strategies are possible. We can distribute every 
iteration onto processors in a card shuffling fashion, or allocate several contigu- 
ous iterations at one processor and then next several onto the others until every 
iteration is assigned. The first strategy is not powerful in DOALL loop imple- 
mentation, since the origin processor (assuming all array variables reside at the 
same processor) and others may have almost same number of iterations 
although either the origin processor or the other processors do not start at the 
same time. Assuming that a processor that performs sending variables can do 
next executions as soon as it initiates send operations, and a processor that ini- 
tiates receiving operation must wait until the variables are received, the origin 
processor can share more iterations while the others are receiving variables. 
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The compensation can be realized under the first strategy, however, the origin 
processor will have three loops; one for to compensate communication before 
loop execution, one for after loop execution, and the same number of iterations 
shared on processors. This will cause difficulty in synthesizing loops. 

Another drawback of the first strategy allocating DOALL loops can be 
found from the array storage usages. For example, consider a sample routine 
that uses and updates two contiguous array elements, respectively, as indicated 
below. 

Lj DO 100 I =1,10 
A(I) = B(I) 

A(I-l) = B(I+1) 

100 CONTINUE 


If we use the first strategy, every iteration consumes two disjoint array ele- 
ments B(I) and B(I+ 1), and produces two array elements A(f) and A(/~l). 
Thus, every processor needs 20 array elements. On the other hand, using the 
latter strategy, each processor needs only 12 array elements since some of array 
elements are used and updated repeatedly. Consequently, the latter strategy is 
preferable to DOALL loops DOACR loops can shorten the completion time of 
loops to the lowest cost only when every iteration is evenly distributed onto 
processors, since a loop is DOACR if and only if the loop is not DOALL and 
the benefit factor of distributed operations is positive. In our implementation 
the latter strategy is applied to DOALL loops, and the first one is used for 
DOACR loops. 

Consider a DOALL loop whose loop body execution cost is t^. Then 
the extra iterations that the origin processor has to share for compensation is 

2x t , 

N L 

l 'comp 


and the number of iterations to be shared on processors is 




N-N, 


comp 


Pe 


where N is the total number of iterations and Pe is the number of available 
processors. The total number of iterations that an origin processor should do is 
N eomp + N # 4 . Then the loop bounds at a processor whose identification number 
is PEno are determined by 


LB PEno = LB + (N', mp + (PEno - PEorg ) X JV Juj )x ST 


and 
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UBpEno - min ( UB ,LB dlst +{N comp + ( PEno - PEorg + 1) x N dist -l)xST) 

where PEorg is the identification number of the origin processor, and ST is a 
step size of the loop. 

Communication between a pair of processors is a distinct characteristics of 
loosely coupled distributed memory multiprocessor architectures. The informa- 
tion is directly passed through either a message passing or a circuit switching. 
In any cases, a communication channel set up time is relatively expensive com- 
pared to the actual communication cost of a unit data transfer. If a large 
number of data is to transfer, and a new communication channel has to be rein- 
itialized, communication overhead will increase proportional to the number of 
data to transfer. If we send a block of data through the same channel that was 
used, we can eliminate the communication channel set up time. To realize 
block transfer operations, we define two array transfer primitives by 

SEND ( A( • • • I • • • ) I = La, Ua, ST, Pto ) 

and 

RECV ( A( • • • I • • ) I = La, Ua, ST, Pfr ) 

where La and Ua are a lower bound and an upper bound of the block data 
referring array A , and Pto and Pfr are processor identification numbers to be 
sent and to be received, respectively. Computing La and Ua are somewhat 
similar to calculating the loop indices LB, UB. When an origin processor ini- 
tiates distributing an array A , the two bounds are given by 

La = LB + m'm{Disp{A ul ,k), ■ ■ • ,Disp{A un ,k )) 

and 

Ua= UB + max{Disp(A vV k), • • • ,Disp(A un ,k)) 

where A UI is a array referenced in the loop body. When an origin processor col- 
lects an array A to restore in it, the two bounds are determined by 

La = LB + min{Disp{A dv k), ■ ■ ■ ,Disp(A dm ,k )) 

and 

Ua = UB + ma.x(Disp(A dl ,k), • ■ ■ ,Disp(A dm ,k )). 

Communication primitives must be carefully inserted so as to avoid “dead 
lock” caused by a non-pair send-receive operations. To prevent dead lock, we 
add conditional statements to match the pair of communications as 
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IF ( • • • ) THEN 

send or receive primitives 

END IF 

Consider the general format of the expected parallel codes for DO ALL loops, as 
indicated below. 

A ( • • • ) = ••• 

B (...)= ... 

L, DO 100 I = LB, UB, ST 

Sj ••• = A ( • • • I • • • ) 

S k B( ■ • ■ I •••)=•• • 

100 CONTINUE 

(a) Sequential DOALL Type Loop 

A ( ' )=- • 

{ Distribute array A to PE t through PE m } 

Compute array A ’s indices, La, Ua for PE[ ; 

SEND ( A( • • • I • • ' ) I = La, Ua, ST, PE t ) 

Compute array A ’s indices, La, Ua for PE m ; 

SEND ( A( • • • I • • ) I = La, Ua, ST, PE m ) 

B("-)= 

{ Start loop } 

Calculate LB org and UB org ; 

DO 100 I = LB org , UB org , ST 

Sj ■■■ = A( • • • I • - • ) 

S t B( • • • I •••)=•• • 
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100 CONTINUE 

{ Restore Scalar values } 

Calculate PE iast , in which the last iteration is performed; 
RECV ( I, PE last ); 

{ Restore Array B by receiving from processors } 
Calculate array fl’s indices, La, Ua for PE t ) 

RECV ( B( • ■ • I • • • ) I = La, Ua, ST, PE t ) 

Calculate array B’s indices, La, Ua for PE m ; 

RECV f Bf • • • I • • • ) I = La, Ua, ST, PE m ) 


(b) Parallel code in an origin processor 


Compute LB di ,, and UB disi ; 

DO 100 I = LB dllt , UB disl , ST 

IP 1 (first iteration) THEN 

Calculate array A ’s indices, La, Ua; 

RECV ( A( ■ • • I • • • ) I = La, Ua, ST, PEorg ) 

ENDfF 

Sj ••• = A ( - * • I • • • ) 

S k B( ••• I •••)=• • 

100 CONTINUE 

{ Pass Updated Scalar to origin processor } 

EF ( last iteration ) THEN 
SEND ( I, PEorg ) 

END IF 

{ Pass Updated Array to PEorg } 

Calculate array B’s indices, La, Ua for origin processor; 
RECV ( B( • • • I • • • ) I = La, Ua, ST, PEorg ) 


(c) Parallel code in non-origin processors 
Figure 7. Expected Parallel codes for DOALL Loops. 
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A boolean expression, first iteration ensures one receiving operation during 
loop iterations by checking whether / = LB dist . Similarly, last iteration 
ensures one send operation after completion of the entire loop to restore the 
scalars that are updated during the iterations. 

Several loops sometimes form a nested loops, in which either DOALLs or 
DOACRs are included. If loop size is not known at compile time, the compiler 
chooses an outermost DOALL loop as a parallelizable loops. As seen Figure 7, 
all processors share the iterations for the same amount of time unless the 
number of iterations is small enough to partition onto few processors. Thus the 
compiler assumes all processors have their own iterations to do, and this 
assumption prohibits further parallelization of the loop body that every proces- 
sor has. Consequently, the loop body of DOALL loops are assumed sequential 
codes. We will show the actual results of nested DOALL loops in the next 
chapter. 

Iteration distributing strategy in a card shuffling fashion is used for 
DOACR loop allocation. The expected parallel codes for given DOACR loops 
are similar to that of DOALL types except the variables that are used at the 
following iteration are sent to processors they need and that are received from 
the predecessors before they are needed. During DOACR loop synthesis, the 
compiler does not need to change the upper bound of the loops. The step size 
has to be modified to STxPe as does card shuffling, and the lower bound has 
to be normalized to correct the initial index values in every processor. 

Synthesizing the codes for variables that are “reference-only’ arrays are 
the same as DOALL loops, such that reference-only arrays are the ones whose 
elements are referenced but not redefined in the loop body. Using the notations 
defined for DOALL loops, we can formalize the expected parallel codes for a 
given example, as indicated in Figure 8. 
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A( • 1 ' ) — • • • 

Li DO 100 I = LB, UB, ST 

S p 

S n A( • • • l+Dj • • • ) = • • • 

S m • • • = A( • • ■ I • • • ) 

100 CONTINUE 

(a) DOACR loop code 

A (...)= ... 

{ Code for reference-only array distribution is identical to 
DOALL loops } 

Li DO 100 1 = LB, UB, Pe xST 

s p ■■■ 

S„ A( • • • 1+ D, •■•)=•• ■ 

IF (NOT last iteration ) THEN 

SEND ( A{ • • • I +Dj ■ ■ ■), modiPEno+Df)) 

END 

IF (NOT first iteration) THEN 

RECV ( A( • • - I • • •)> mod{Pe + PEno — D r )) 
END IF 

S m ... =A(---I---) 

S e 

100 CONTINUE 

{ Restore scalars } 

Calculate PE (ait , in which the last iteration is performed; 
RECV (I ,PE l<ut ) 

{ Restore A } 

calculate A ’s indices, La, Ua for PE t 
RECV ( A( • • • I • • • ) I=La, Ua ,Pe , PE, ) 

calculate A’s indices, La, Ua for PE m 
RECV ( A( • • • I • • • ) I=La, Ua ,Pe , PE m ) 


(b) Parallel code in an origin processor 
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L x DO 100 I = LB+ mod (Pe + PEno - PEorg), UB, PeX ST 

s p ■■■ 

S n A( • • • I +Dj • ■ • ) = 

IF (NOT last iteration) THEN 

SEND ( A( * • • I ■ • • ), mod(PEno+Dj)) 

END 

IF ( f irst iteration) THEN 

RECV ( A( • • • I • • • ), PEorg ) 

ELSE 

RECV ( A( • • • I • • • ), mod{Pe + PEno -£>,)) 
END IF 

5 m • • • = A( • * • I • • • ) 

5 e 

100 CONTINUE 

{ Restore scalars } 

(last iteration) THEN 
SEND ( I, PEorg ) 

END IF 
{ Restore A } 

calculate A ’s indices, La, Ua for PEno 
SEND ( A( • • • I • ■ • ) I=La, Ua ,Pe , PEno ) 

(c) Parallel code in non-origin processors 
Figure 8. Expected Parallel codes for DOACR Loops. 


We will consider an actual DOACR loop routine in the next chapter. 


CHAPTER 3 


CASE STUDY : AIRSD 


The AIR3D program [PuS78] is a huge program consisting of a large 
number of statements with several subroutine calls, numerous variables and 
many data dependencies between statements ought to be considered. We do 
not go through the details of routines of the AIR3D program package, since our 
discussion in this report is to show how a parallelizing compiler can automati- 
cally transform sequential codes into parallel versions. 

Before doing the actual restructuring or synthesizing the codes, few con- 
trol flow statements need to be rearranged for simplicity. The AIR3D program 
extensively uses branch statements in which the condition is determined at run 
time. Thus, the number of statements to be executed would depend upon the 
value of the input data. Furthermore, compilers cannot follow the control 
flows except for simple cases, since in many cases there are cyclic control flows 
between statements. We therefore replace branch statements into correspond- 
ing comparison statements, in which the number of statements within a com- 
parison block is deterministic. This replacement make the speedup analysis 
easier. 

Several branch type blocks exist in the package. Denoting a “forward” 
branch such that all branch targets always follow branch instructions, and a 
“backward” branch such that branch targets locate before the branch instruc- 
tion, we describe a way to restructure branch blocks into corresponding com- 
parison blocks. Common branch structures in AIR3D package are ones having 

‘IF ( B-expr ) GOTO 5,’ 

where 5, is a branch target to jump if B — expr is true. Branch blocks shown in 
XXM, YYM and ZZM routines use this structure. 

A backward branch block, in many cases, simulates a loop, similar to 
“while” structures in Pascal. However, it is hard to restructure into a loop, 
since there are many uncertainties to determine the loop types. In the experi- 
ments, we do not consider restructuring backward branch blocks. Another 
branch structure is to use three branch targets as 

‘IF ( B-expr ) S v S 2 , S 3 ’, 

where S lf S 2 , S 3 are the branch targets among which one of three locations is 
active according to B - expr condition. If the instruction is forward branch, it 
can be restructured similar to ‘GOTO’ types. We show an example of ‘GOTO’ 
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type forward branch restructuring in Figure 9. 


W{B~expr{) GOTO S bl 
lF{B-expr 2 ) GOTO S b2 

IF (B- expr n ) GOTO S bn 
Sbo 

goto s bc 

Sbl 

GOTO S bc 

$b 2 


GOTO S bc 
Sbn 

S bc CONTINUE 


(a) A Brain h Block with n Branches 


IF(.NOT.(B-cxpr 1 ).AND.(.NOT.(B-expr 2 ).AND. • • • 
.AND. (.NOT. (5 - expr n )) THEN 
S b o 

ELSE IFfflSj) THEN 

S bl 

ELSE IF(J3S 2 ) THEN 
S b 2 


ELSE 

Sbn 

S bc END IF 

(b) Corresponding Comparison block restructured from (a) 


Figure 9. Forward branch block restructuring. 
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1. Sample Basic Block : FLUXVE 

Among several routines of the AIR3D program, AMATRX , DEBUG , and 
FLUXVE are the routines that have few basic blocks (refer to Appendix A). A 
basic block is a parallelizing source by the flexible granularity model associated 
with the communication model. We choose FLUXVE as a sample routine. 
Applying the vertical partitioning and the associated scheduling algorithm, we 
can detect parallelism of the routine so that multiprocessor architectures could 
be utilized. 

Subroutine callers pass arguments through a processor that activates callee 
routines. We assign a processor, say ‘1’ be the origin processor of the caller’s 
arguments. Thus all undefined values referenced are assumed to be in the pro- 
cessor ‘1’. Consider the actual codes shown in Appendix A-a). J, KL , R 1, etc 
of the first statement are are thought to be in PE l . 

Theoretically, all the disjoint medium grain task sets can be allocated to 
different processors. However, due to a limited number of processors, multiple 
task set have to be allocated to the same processor. To realize medium grain 
task allocation, we begin to allocate the longest path task and earliest finishing 
time task first. During the allocation a “processor time space table” is main- 
tained, which gives the information such as when a processor is ready to accept 
tasks. The processor finding procedure searches the processor time space table 
to choose the “best fitting' processor, in which the time space gap becomes 
minimum. In this way the task can start by the latest starting time of the task 
or the task can start as soon as possible if it cannot start at its desired time. If 
there are more than two tasks whose path costs are the same, then the earliest 
finishing time task is chosen for allocation. If there is a time space, then an 
unallocated small task can be inserted (and can be finished within the time 
space), prior to the longest path task allocation. 

This strategy simulates a statement reordering, which is a technique to 
change the order of statements retaining the same result. Since a pair of tasks 
can be allocated onto different processors until their time boundaries are 
preserved, allocating small tasks that can finish earlier can be allocated before 
the task that starts after the small task has been finished. We formalize the 
basic block allocation algorithm, as below. 
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1. Sort tasks in descending order of the task cost; 

2. while task pool is not empty do 

2.1. Choose the longest task, T„ from the task pool; 

H m ax 

2.2. Find the best fitting processor, P Pmxx for T Pm ^; 

2.3. while exist do 

Check time space gap of P Pmxx , 

Choose the longest task, T p > that can fit in the space gap 
and that can finish within the space gap; 
if T p ' exists then 

allocate T„ ■ onto P„ ; 

r P max 2 * * * * 7 

delete T p > from the task pool; 
else reset exist; 

2.4. allocate T„ onto P_ ; 

/'max P max 7 

Figure 10. Basic Block Task Allocation Algorithm. 


In the experiments, we use four identical processors to allocate tasks. We 
can change the number of processors interactively during the synthesis phase. 
We show the synthesized parallel codes of FLUXVE routine in Appendix A-b). 


2. Sample DOALL Loop : XXM 

A parallel loop allocation is a key issues in the loop implementation. As 
we have described in Chapter 2, DOALL loops are free from finding processors 

during the allocation since virtually all processors share the loops. Every itera- 
tion is shared based on the loop body cost and the communication overhead 
needed to distribute arrays. 

Among several routines in AIR3D package, XXM, YYM, ZZM and 

DIFFER have one or several DOALL loops within the routines. Before restruc- 
turing the sample codes, the branch blocks are replaced with comparison 
blocks, and then loop type detection algorithm determines as is DOALL 
through the algorithm shown in Figure 6. Using the algorithm of Figure 6, 
XXM is seen to be DOALL type and its loop body cost is 2652 time units. 1 The 

'Timing results used in this report are the arbitrary values, not of the real machines. The time for 
one floating point multiplication is assumed SO unit times. 




29 


loop body cost is an average execution cost of one iteration, which realizes all 
comparison blocks are evenly hit during the execution. Assuming the commun- 
ication overhead be 1000 time units, 2 the origin processor will execute only one 
more iteration than non-origin processors since the time space for one iteration 
provides enough time space for non-origin processors to receive necessary vari- 
ables and to restore the updated values in the origin processor upon completion 
of the loop. 

The DOALL loop allocation algorithm is given in Figure 11. The algo- 
rithm firstly finds the origin processor that will start the first iteration and are 
responsible for distributing the arrays. The origin processor is easily deter- 
mined by searching processors that reside the task color number of the loop, 
where the task color number was assigned during the task partitioning. Then 
it synthesizes the loop bounds and loop header. Every statement in the loop 
body is allocated to every processor one by one by adding communication prim- 
itives and associated comparison statements. When the loop body encounters 
the loop label, it begins restoring the arrays and scalars in the origin processor. 
We show the synthesized parallel code of XXM , YYM, and ZZM in Appendices 
B, C, and D, respectively. 


s 1000 of communication overhead simulates a hypercube environment, such as the iPSC machine. 
Compared to the cost of multiplication (80), the communication cost is fairly large enough to simulate 
communications in a loosely coupled distributed multiprocessor environment. 
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input : DOALL loop 
output : Parallel DO Loop 

1. Find the origin processor by task color number; 

2. Append loop bound onto processors; 

3. Put ‘DO xxx L var = LB ,UB ,ST’; 

4. for every statement, 5, do 

4.1. while used variables are not available in a processor PEno do 

if processor is PEorg then 
Put send primitives; 
if processor is PEno then 

Append ‘IF ( first iteration) THEN’; 

Put recv primitives; 

Append ‘END IF’ 

4.2. Append 5, ; 

{ Restore defined variables to PEorg } 

5. while defined variables exist in a processor PEno do 

5.1 if processor is PEorg then 

Put recv primitives; 

5.2 if processor is PEno then 

Append ‘IF ( last iteration) THEN’; 

Put send primitives; 

Append ‘ENDIF’; 

Figure 11. DOALL Loop Allocation Algorithm 


The drawback of DOALL loop allocation in Figure 11 is that it may not 
fully utilize the processors if the number of iterations is not large enough to be 
shared by every processor. If this is the case, then fewer processors will execute 
iterations although loop body could be parallelized further than that can be 
detected as per the flexible granularity model). For a partially shown DOALL 
loop of RHS routine given below, assume that KMAX iterations are distributed 
onto a Pe -processor system. With KMAX = Pe , each processor performs one 
iteration. The entire loop body is put into one processor and then, the comple- 
tion time of the loop would be the loop body cost. However, if Pe ^2 x KMAX, 
every iteration of L ! can assign KMAX processors, and the inner loop body can 
use another processor to shorten the completion time of the loop body. In this 
case, better speedup can be achieved than when KMAX= Pe . 
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L x DO 10 K = 1,KMAX 

KL = (L-1)*ND+K 
Rl = YY(K,1) 

R2 = YY(K,2) 

R3 = YY(K,3) 

R4 = YY(K,4) 

CALL FLUXVE(J,KL,R1,R2,R3,R4) 
L 2 DO 12 N = 1,5 

12 F(K,N) = FV(N) 

10 CONTINUE 


(a) A partial code from RHS routine 



(b) Expected Parallel code when two processor performs a loop body 


Figure 12. More Speedup can be achieved by Loop Body Partitioning. 


We can designate processors manually at compile time, allowing a number 
of processors to execute a loop body together. Several processors then, can 
share the loop body parallelism. In most cases, however, proposed DOALL loop 
allocation algorithm is good enough as the number of iterations is generally 
larger than the number of processors. In this report, we do not consider loop 
body parallelization. 


3. Sample DOARC Loop : GRID 

Let us consider a sample DOACR loop of GRID routine. The variables N 
and array G have to be passed from one processor to the successive processors 
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that perform the following iterations. Assume the origin processor be PE X in 
the 4-processor system. The processor 1 passes N to the processor 2 upon com- 
pletion of Si, and performs S 2 and S 3 . Upon completion of S 3 the processor 1 
also transfer G{\,J) to the processor 2. While the processor 1 performs S 2 and 
S 3 , the processor 2 has to wait for N to be received from the processor 1. As 
soon as N has arrived, the processor 2 begins doing S v Similarly the processor 
2 sends N to the processor 3 upon completion of and G(J) upon completion 
of S 3 , respectively. By replicating these processes, every iteration is performed 
in a “software-pipelining” fashion. Software-pipelining operation is advanta- 
geous only when the communication overhead is less than the overlapped com- 
putation cost. The example given in Figure 13 a) is DOACR while communica- 
tion cost is less than 810 unit time as seen in Figure 13 b). The expected paral- 
lel codes are given in Figure 13 c). 



L x DO 912 J = NB11.NB2 


(22) 

Si 

N = N+l 

(788) 

S 2 

GN = (1.0+EPS/SQRT(FLOAT(N+4))) 

(104) 

£3 

G(1,J) = G(1,J-1)+DX*GN 

912 

CONTINUE 


(a) A Sample DOACR Code from GRID routine 


P 

i 



(b) Communication between processors 
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C ALLPE IS TOTAL NUMBER OF PROCESSORS 
C MYPE IS MY PROCESSOR ID NUMBER 
C PEORG IS THE PROCESSOR THAT INITIATES LOOP 
C MZZ, LZZ, LQQ ARE TEMP. VALUES 
MZZ = NB2 
LZZ = NB11 

LQQ = MOD (ALLPE + LZZ + MYPE - PEORG) 

C BEGIN OF LOOP 

DO 912 J = LQQ, NB2, ALLPE 
IF (J.NE.LZZ) THEN 

RECV (N, MOD(ALLPE + MYPE -1)) 

END 

N = N+l 

IF (J.LT.MZZ) THEN 

SEND (N, MOD(MYPE + 1)) 

END IF 

GN = (1.0 +EPS/SQRT(FLOAT(N+4)))**N 
IF (J.NE.LZZ) THEN 

RECV (G(1,J-1), MOD(ALLPE + M\TE -1)) 

END 

0(1, J) - G(1,J-1)+DX*GN 
IF (J.L I MZZ) THEN 

SEND (G(1,J), MOD(MYPE + 1)) 

END IF 

912 CONTINUE 

(c) Expected Parallel codes. 


Figure 13. DOACR Loop Allocation and Code Synthesis. 


CHAPTER 4 


PERFORMANCE EVALUATION 


We have implemented the major portion of the B-HIVE parallelizing com- 
piler. The compiler accepts input programs written in FORTRAN, along with 
the system configuration, produces an allocation scheduling list, and finally 
produces separate program for each processor. We did not implement a 
software package to analyze precisely the timing result of the routines. How- 
ever, the compiler gives timing information during the scheduling phase and 
the code synthesis phase. Since the static allocation distributes the tasks based 
on the processor time space table, the time information in the table gives data 
for precise evaluation. However, it cannot fully support timing issues. For 
example, DOALL loop allocation algorithm does not consider how many itera- 
tions the loop should perform, and it assumes one iteration be assigned onto 
one processor except the origin processor. Thus, the processor time space table 
does not project the effect of iteration number a processor should perform. We 
therefore calculate the timinp results in hand, based on the timings of the pro- 
cessor time space table. 

We choose a number of AIR3D routines and process as stated in Chapters 
1 and 3. The subroutines used in our experiment were XXM , YYM and ZZM 
for DOALL cases, FLUXVE and AMATRX for basic blocks, and several other 
routines are considered for DOACR loops. The parallel codes shown in this 
report are restructured automatically but for the DOACR routines. Thus, we 
do not evaluate the timings for the DOACR type routines. 

In the experiment, we defined several working environments, setting the 
number of iterations, the number of processors, and the communication cost. 
The number of processors is varied, using 4, 16, 32 processors. We put the syn- 
thesized parallel codes for the 4-processor system in the Appendices. We 
assume that the communication cost between any two processors will be the 
same (disregarding differences due to the computer network and the distance of 
the two communicating processors). We also vary the communication cost be 
1000 and 5000 unit times per word. 1 


'Communication cost of 1000 simulates a hypercube MIMD machines, such as 
Communication cost of 5000 are used for the comparison purpose. 


the iPSC Hypercube. 
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Parallelizing encounters overhead, such as synchronization, communica- 
tion overhead, and an extra code execution overhead. Thus the speedup is not 
always proportional to either the number of processors or the degree of parallel- 
ism. The speedup value is calculated from dividing the sequential execution 
time by the turn around time. 

The timing results and the speedup factors for XXM are shown in Table I 
and Table II. The results show that the parallel codes can achieve up to 28 in 
the 32- processor system. This implies that the transformed DOALL loops are 
highly parallelized, and as a result, the compiler can transform the DOALL 
loops effectively. The better speedup achievement is observed when the 
number of processors and the number of iterations are increased. However, 
increment of processors does not necessarily increase the speedup when the 
number of iteration is small. For example, when we increase the number of 
processors from 16 to 32 for 10 iterations, the speedup values are decreased, as 
indicated in Table I and II. It is also observed that the speedup values are 
decreased when the unit communication cost is increased, as expected. We 
observed the results for YYM and ZZM . The trends seem quite similar to 
XXM. 

The B-HIVE compiler can also improve speedup factors for basic blocks, 
such as FLUXVE, that has several assignment statements only. Normally it is 
not known to be parallelizable, but the flexible granularity model provides a 
way to obtain parallelism as we have described. The simulation results when 
the communication cost be 1000 and 5000 unit times, the speedup factors are 
1.1 and 1.0, respectively. The unit speedup at 5000 is expected, since the com- 
piler avoids excessive parallelizing automatically by assigning a whole routine 
to a single processor. The timing results are given in Table VII. 



Table I. The Total Times Needed for Execution of Subroutine XXM. 



Number of Iteration (Comm, cost = 1000) 

# of PE 

10 

100 

1000 

4 

11253 

69597 

666297 

16 

9365 

25277 

173789 

32 

11273 

19229 

93485 



Number of Iteration (Comm, cost = 5000) 

# of PE 

10 

100 

1000 

4 

16553 

74897 

671597 

16 

16822 

30082 

181246 

32 

21382 

26686 

103594 


Table II. Speedup Result for XXM 



Number of Iteration (Com>i. t ost = 1000) 

# of PE 

10 

100 

1000 

4 

2.40 

3.82 

3.98 

16 

2.88 

10.51 

15.26 

32 

2.39 

13.82 

28.37 



Number of Iteration (Comm, cost = 5000) 

#of PE 

10 

100 

1000 

4 

1.63 

3.55 

3.95 

16 

1.60 

8.83 

14.63 

32 

1.26 

9.96 

25.60 
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Table III. The Total Times Needed for Execution of Subroutine YYM. 



Number of Iteration (Comm, cost = 1000 ) 

# of PE 

10 

100 

1000 

4 

11827 

73867 

708367 

16 

9607 

26527 

184447 

32 

14167 

22627 

101587 



Number of Iteration ( Comm, cost = 5000) 

# of PE 

10 

100 

1000 

4 

17467 

79507 

714007 

16 

18067 

32167 

192907 

32 

22627 

28267 

110047 


Table IV. Speedup Result for YYM 



Number of Iteration (Comm, cost = 1000) 

# of PE 

10 

100 

1000 

4 

2.40 

3.80 

3.98 

16 

2.98 

10.65 

15.29 

32 

2.02 

12.48 

27.76 



Number of Iteration ( Comm, cost = 5000) 

# of PE 

10 

100 

1000 

4 

1.64 

3.55 

3.95 

16 

1.58 

8.78 

14.62 

32 

1.26 

9.99 

25.63 
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Table VI. The Total Times Needed for Execution of Subroutine ZZM . 



Number of Iteration (Comm, cost = 1000) 

# of PE 

10 

100 

1000 

4 

11557 

72057 

690807 

16 

9477 

25977 

179977 

32 

14037 

22287 

99287 



Number of Iteration (Comm, cost = 5000) 

# of PE 

10 

100 

1000 

4 

19807 

77557 

696307 

16 

17727 

27787 

107537 

32 

22287 

27787 

10753" 


Table VI. Speedup Result for ZZM 


Number of Iteration [Comm, cost = 1000) 


# of PE 

10 

100 

1000 

4 

2.41 

3.82 

3.98 

16 

2.94 

10.60 

15.28 

32 

1.99 

12.37 

27.70 



Number of Iteration (Comm, cost = 5000) 

# of PE 

10 

100 

1000 

4 

1.41 

3.55 

3.95 

16 

1.57 

8.75 

14.61 

32 

1.25 

9.91 

25.58 
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Table VII. The Total Times Needed for Execution of Subroutine FLUXVE . 


[Comm, cost = 1000) 

# of PE 

Total Time 

Speedup 

1 

3268 

1.0 

4 

2978 

1.1 


(Comm, cost = 5000) 

# of PE 

Total Time 

Speedup 

1 

3268 

1.0 

4 

3268 

1.0 




CHAPTER 5 


CONCLUSION 


An approach for parallelizing sequential program has been developed in 
this work. The vertical partitioning and appropriate scheduling is used in the 
B-HIVE parallelizing compilers. The performance of the parallelizing models is 
determined using several routines of AIR3D program package. Especially, the 
communication overhead is considered to evaluate the task allocation on a dis- 
tributed memory multiprocessor system. As seen in the parallel code synthesis 
examples, the B-HIVE compiler can transform the sequential codes into parallel 
version automatically. The speedup factor is quite close to the number of pro- 
cessors when the number of iterations is large, and the results seems superior to 
the results in [EBS84]. The current version of the compiler can restructure 
DOALL loops, basic blocks automatically, and will be updated to restructure 
the DOACR loops. 

Beyond parallelizing, the parallelizing compiler should consider a way to 
restructure branch instructions, especially the forward branch type codes, 
automatically. Upon completion of the compiler implementation, we are plan- 
ning to test the correctness of the compiler for several scientific packages, such 
as EISPACK and LINPACK. Actual test will also be conducted thereafter. 


1. Software Packages Developed 

So far, we have implemented several software packages that are used in 
the various phases of the B-HIVE compiler at North Carolina State University. 
Following is the list of software packages classified as per the compiling phases. 

(1) Phase 1 : FORTRAN Program Syntax Verifier: implementation com- 
pleted. 

(2) Phase 2 : Program Partitioned implementation completed. 

(3) Phase 3 : Basic Block Partitioned implementation completed. 

(4) Phase 4 : Infinite Processor Scheduler. 

a) Loop Type Checker has been implemented. 

b) Basic Block and DOALL Loop schedulers have been implemented. 

c) DOARC and DOSEQ Loop schedulers are under construction. 
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(5) Phase 5 : Finite Processor Allocator. 

a) Basic Block and DOALL Loop Allocator have been implemented. 

b) DOACR and DOSEQ Allocators are under construction. 

(6) Phase 6 : Code Synthesizer. It has been implemented. 

(7) Phase 7 : B-HIVE Coordinator / object code generator. It is under con- 
struction based on the communication primitives designed for DOALL 
loops and Basic blocks. 


2. Future Plan 

We will continue the implementation of the parallelizing compiler and 
make it operational on a real machine (B-HIVE). The necessary work to be 
done during ’88 and ’89 includes : 

(1) Completion of DOARC and DOSEQ loop scheduler associated with allo- 
cation strategy development. 

(2) Study on subroutine calls and argument passing strategies on a loosely 
coupled distributed memory multiprocessor environment. 

(3) Extensive evaluation of the compiler with testing. Testing will include 
restructuring of AIR3D package and other scientific packages. 


3. Suggestion and Comment 

The parallelizing compiler work is a time consuming project. It requires 
highly advanced techniques in various fields, such as parallel processing, com- 
piler construction, data structure implementation, and programmings, and etc. 
If additional funds are awarded, we could complete the implementation of the 
parallel compiler soon, and newer techniques could be developed. With the 
NASA’s support, we have no doubt that our compiler will be the first actual 
parallel compiler for the loosely coupled distributed memory multiprocessor 
environment. 
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APPENDICES 


A: Subroutine FLUXVE 

B: Subroutine XXM 

C: Subroutine YYM 


D: Subroutine ZZM 


APPENDIX A. SUBROUTINE 'FLUXVE' 
a) Sequential Code 


RR = 1 . /Q (KL, 1 / J) 

U = Q (KL, 2, J) *RR 
V = Q (KL, 3, J) *RR 
W = Q (KL, 4, J) *RR 
QS = R4+R1*U+R2*V+R3*W 

PP = GAMI* (Q(KL / 5 / J) -.5*Q(KL / 1, J) * (U*U+V*V+W*W) ) 
RJ = l./Q(KL,6,J) 

QSINFJ = (R4+R1*UINF+R2*VINF+R3*WINF) *RJ 

PINFJ = PINF*RJ 

FV ( 1 ) = Q(KL,1,J)*QS 

FV (2) = Q (KL/ 2, J) *QS+R1*PP 

FV (3) = Q (KL, 3, J) *QS+R2*PP 

FV (4 ) = Q (KL, 4, J) *QS+R3*PP 

FV (5 ) = (Q (KL, 5, J) +PP) *QS-R4*PP 

FV ( 1 ) = FV(1) -QSINFJ 

FV (2) = FV (2) -UINF*QSINFJ-R1*PINFJ 

FV (3) = FV (3) -VINF*QSINFJ-R2*PINFJ 

FV (4) = FV(4) -WINF*QSINFJ-R3*PINFJ 

FV (5) = FV (5) - (EINF+PINF) *QSINFJ+R4*PINFJ 

RETURN 

END 


9 


APPENDIX A (continue) 


b) Parallel Code 

FILE 1 {Processor 1} 

CALL SEND (J 4) 

CALL SEND (KL 4) 

CALL SEND (R1 • 4) 

CALL SEND (R2 3) 

CALL SEND (J 3) 

CALL SEND (KL 3) 

CALL SEND (R3 2) 

CALL SEND (J 2) 

CALL SEND (KL 2) 

RR = l./Q(KL,l, J) 

CALL SEND (Q 4) 

CALL SEND (Q 3) 

CALL SEND (Q 2) 

CALL SEND (UINF 4) 

CALL SEND (VINF 3) 

CALL SEND (WINF 2) 

U = Q(KL,2, J) *RR 
V = Q (KL, 3, J) *RR 
W = Q (KL, 4, J) *RR 
QS = R4+R1*U+R2*V+R3*W 

CALL SEND (QS 4) 

CALL SEND (QS 3) 

CALL SEND (QS 2) 

PP = GAMI* (Q(KL,5, J) -,5*Q(KL, 1, J) *(U*U+V*V+W*W) ) 
CALL SEND (PP 4) 

CALL SEND (PP 3) 

CALL SEND (PP 2) 

RJ = 1 . /Q (KL, 6, J) 

QSINFJ = (R4+R1*UINF+R2*VINF+R3*WINF) *RJ 
CALL SEND (QSINFJ 4) 

CALL SEND (QSINFJ 3) 

CALL SEND (QSINFJ 2) 

PINFJ = PINF*RJ 

CALL SEND (PINFJ 4) 

CALL SEND (PINFJ 3) 

CALL SEND (PINFJ 2) 

FV (5) = (Q (KL, 5, J) +PP) *QS-R4*PP 

FV (5) = FV (5) - (EINF+PINF) *QSINFJ+R4*PINFJ 

CALL RECV (FV (4 ) 2) 

CALL RECV ( F V ( 3 ) 3) 

CALL RECV (FV ( 1 ) 4) 

CALL RECV ( F V ( 2 ) 4) 

END 

9 


9 



FILE 


2 


{Processor 2} 


CALL 

RECV 

(PP 

1) 

CALL 

RECV 

(R3 

1) 

CALL 

RECV 

(QS 

1) 

CALL 

RECV 

(J 

1) 

CALL 

RECV 

(KL 

1) 

CALL 

RECV 

(Q 

1) 

FV (4) 

= Q (KL, 4, J) *QS+R3*PP 


CALL 

RECV 

(PINFJ 

1) 

CALL 

RECV 

(QSINFJ 

1) 

CALL 

RECV 

(WINF 

1) 

FV (4 ) 
END 

= FV 

(4) -WINF*QSINFJ-R3*PINFJ 


CALL 

SEND 

( F V ( 4 ) 

1) 


9 


9 



FILE 


3 


{processor 3} 


CALL RECV (PP 1) 

CALL RECV (R2 1) 

CALL RECV (QS 1) 

CALL RECV (J 1) 

CALL RECV (KL 1) 

CALL RECV (Q 1) 

FV (3) = Q <KL, 3, J) *QS+R2*PP 

CALL RECV (PINFJ 1) 

CALL RECV (QSINFJ 1) 

CALL RECV (VINF 1) 

FV (3) = FV (3) -VINF*QSINF J-R2 *PINF J 
CALL SEND (FV(3) 1) 

END 


9 


9 


FILE 


4 


{processor 4} 


CALL RECV (QS 1) 

CALL RECV (J 1) 

CALL RECV (KL 1) 

CALL RECV (Q 1) 

FV (1) = Q (KL, 1, J) *QS 

CALL RECV (PP 1) 

CALL RECV (Rl 1) 

FV (2) = Q (KL, 2, J) *QS+R1*PP 

CALL RECV (QSINFJ 1) 

FV ( 1 ) = FV ( 1 ) -QSINFJ 

CALL SEND (FV(1) 1) 

CALL RECV (PINFJ 1) 

CALL RECV (UINF 1) 


FV (2) = FV (2) -UINF*QSINFJ-R1*PINFJ 
CALL SEND (FV(2) 

END 


1 ) 



APPENDIX B. SUBROUTINE 'XXM' 


a) Sequential Code 


K = M 

DY2 = . 5/DY1 
DZ2 = . 5/DZ1 
KL = (L-l) *ND+K 
KP = KL+1 
KR = KL-1 
LP = KL+ND 
LR = KL-ND 
DO 10 J = J1,J2 
RJ = Q (KL, 6, J) 

IF( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 

XK = (X (KP, J) -X (KR, J) ) *DY2 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 

ELSE IF(K.EQ.l) THEN 

K1 = KL+1 

K2 = KL+2 

XK = (-3.*X(KL, J) +4.*X(K1, J) -X(K2, J) ) *DY2 
YK = (-3 . *Y (KL, J) + 4 . *Y (K1,J) - Y (K2,J) ) *DY2 
ZK = (-3. *Z (KL, J) +4 .*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE 

Kl = KL-1 
K2 = KL-2 

XK = (3.*X(KL, J)-4.*X(K1,J)+X(K2, J))*DY2 
YK = (3.*Y(KL, J)-4.*Y(K1,J)+Y(K2, J))*DY2 
ZK = (3.*Z(KL, J)-4.*Z(K1, J)+Z(K2, J) ) *DY2 
END IF 

IF ( (L.NE. 1) .AND. (L.NE.LMAX) ) THEN 
XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP, J) -Y (LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
LI = KL+ND 
L2 = KL+2*ND 

XL = (-3 . *X (KL, J)+4.*X(L1,J)-X(L2,J) ) *DZ2 
YL = (-3.*Y(KL, J)+4.*Y(L1,J)-Y(L2, J))*DZ2 
ZL = (-3.*Z(KL,J)+4.*Z(L1,J)-Z(L2,J))*DZ2 
ELSE 

LI = KL-ND 
L2 = KL-2*ND 

XL = (3.*X(KL, J)-4.*X(L1, J)+X(L2, J) ) *DZ2 
YL = (3 . *Y (KL, J) -4 . *Y (L1,J) +Y (L2,J) ) *DZ2 
ZL = (3.*Z (KL, J) -4 ,*Z (LI, J) +Z (L2, J) ) *DZ2 
END IF 

XX(J, 1) = (YK*ZL-ZK*YL) *RJ 
XX (J, 2) = (ZK*XL-XK*ZL) *RJ 
XX (J, 3) = (XK*YL-YK*XL) *RJ 

XX (J, 4) = -OMEGA* (Z (KL, J) *XX ( J, 2) -Y (KL, J) *XX (J, 3) ) 



10 CONTINUE 
RETURN 
END 


APPENDIX B (continue) 


b) . Parallel Code 

FILE 1 {Processor 1} 


CALL 

SEND 

( J2 

2) 

CALL 

SEND 

(J1 

2) 

CALL 

SEND 

( J2 

3) 

CALL 

SEND 

(J1 

3) 

CALL 

SEND 

(J2 

4) 

CALL 

SEND 

(J1 

4) 

CALL 

SEND 

(L 

2) 

CALL 

SEND 

(L 

3) 

CALL 

SEND 

(L 

4) 

K = M 



CALL 

SEND 

(KMAX 

2) 

CALL 

SEND 

(KMAX 

3) 

CALL 

SEND 

(KMAX 

4) 

CALL 

SEND 

(K 

2) 

CALL 

SEND 

(K 

3) 

CALL 

SEND 

(K 

4) 

CALL 

SEND 

(LMAX 

2) 

CALL 

SEND 

(LMAX 

3) 

CALL 

SEND 

(LMAX 

4) 

CALL 

SEND 

(ND 

2) 

CALL 

SEND 

(ND 

3) 

CALL 

SEND 

(ND 

4) 

CALL 

SEND 

(OMEGA 

2) 

CALL 

SEND 

(OMEGA 

3) 

CALL 

SEND 

(OMEGA 

4) 

DY2 = 

= . 5/DY1 


CALL 

SEND 

(DY2 

2) 

CALL 

SEND 

(DY2 

3) 

CALL 

SEND 

(DY2 

4) 

DZ2 = 

* . 5/DZ1 


CALL 

SEND 

(DZ2 

2) 

CALL 

SEND 

(DZ2 

3) 

CALL 

SEND 

(DZ2 

4) 

KL = 

(L-l) *ND+K 


CALL 

SEND 

(KL 

2) 

CALL 

SEND 

(KL 

3) 

CALL 

SEND 

(KL 

4) 

KP = 

KL+1 



CALL 

SEND 

(KP 

2) 

CALL 

SEND 

(KP 

3) 

CALL 

SEND 

(KP 

4) 

KR = 

KL-1 



CALL 

SEND 

(KR 

2) 

CALL 

SEND 

(KR 

3) 

CALL 

SEND 

(KR 

4) 

LP = 

KL+ND 






CALL 

SEND 

(LP 

2) 

CALL 

SEND 

(LP 

3) 

CALL 

SEND 

(LP 

4) 

LR = 

KL-ND 



CALL 

SEND 

(LR 

2) 

CALL 

SEND 

(LR 

3) 

CALL 

SEND 

(LR 

4) 


NQQ0=(J2- (Jl) +1-0) DIV 4+1 
LQQ0=J1 

MQQ0=MIN (LQQ0+0+1 *NQQQ-1 , J2) 

LZZ=Jl+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, J2) 

CALL SEND (Q (KL, *,KZZ) KZZ=LZZ, MZZ, 2 ) 

LZZ=Jl+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l / J2) 

CALL SEND (Q (KL, * , KZZ) KZZ=LZZ,MZZ, 3) 

LZZ=Jl+0+3*NQQ0 

MZZ=MIN (LZZ+0+4 *NQQ0-1, J2) 

CALL SEND (Q (KL, *, KZZ) KZZ=LZZ, MZZ, 4) 

LZZ=Jl+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, J2) 

CALL SEND (X (*, KZZ) KZZ=LZZ,MZZ, 2) 

LZZ=Jl+0+2*NQQ0 

MZZ=MIN <LZZ+0+3*NQQ0-l, J2) 

CALL SEND (X ( * , KZZ) KZZ=LZZ, MZZ, 3) 

LZZ=Jl+0+3*NQQ0 

MZZ=MIN (LZZ+0+4 *NQQ0-1 , J2) 

CALL SEND (X (*, KZZ) KZZ=LZZ,MZZ, 4) 

LZZ=Jl+0+NQQ0 

MZZ=MIN (LZZ + 0+2*NQQ0-l, J2) 

CALL SEND ( Y (* , KZZ) KZZ=LZZ, MZZ, 2 ) 

LZZ=Jl+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l, J2) 

CALL SEND (Y ( * , KZZ) KZZ=LZZ , MZZ, 3) 
LZZ=Jl+0+3*NQQ0 
MZZ=MIN(LZZ+0+4*NQQ0-l, J2) 

CALL SEND (Y (*,KZZ) KZZ=LZZ,MZZ, 4) 

LZZ=Jl+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, J2) 

CALL SEND (Z ( * , KZZ) KZZ=LZZ, MZZ, 2) 

LZZ=Jl+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l , J2) 

CALL SEND (Z ( * , KZZ ) KZZ=LZZ, MZZ, 3) 

LZZ=Jl+0+3*NQQ0 

MZZ=MIN (LZZ+0+4 *NQQ0~1 , J2) 

CALL SEND (Z (*,KZZ) KZZ=LZZ,MZZ, 4) 

DO 10 J=LQQ0,MQQ0 
RJ = Q (KL, 6, J) 

IF ( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 
XK = (X (KP, J) -X (KR, J) ) *DY2 
YK = (Y (KP, J) -Y (KR, J) ) *DY2 
ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF(K.EQ.l) THEN 


K1 = KL+1 
K2 = KL+2 

XK = <-3.*X(KL, J)+4.*X(K1, J)-X(K2, J) )*DY2 
YK = (-3.*Y (KL, J) +4 . *Y (Kl , J) - Y (K2,J) ) *DY2 
ZK = (-3.*Z (KL, J) + 4.*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE 

K1 = KL-1 
K2 = KL-2 

XK = (3 . *X (KL, J) -4 . *X (Kl , J) +X (K2 , J) ) *DY2 
YK = (3 . *Y (KL, J) - 4 . *Y (K1 , J) + Y (K2 , J) ) *DY2 
ZK = (3.*Z (KL, J) -4.*Z (Kl, J) +Z (K2, J) ) *DY2 
END IF 

IF ( (L.NE.l) .AND. (L.NE.LMAX) ) THEN 
XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP, J) -Y (LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
LI = KL+ND 
L2 = KL+2 *ND 

XL = (-3 . *X (KL, J) +4 . *X (LI, J) -X (L2, J) ) *DZ2 
YL = (-3 . *Y (KL, J) +4 . *Y (LI, J) -Y (L2, J) ) *DZ2 
ZL = (-3.*Z(KL, J)+4.*Z(L1, J)-Z(L2, J) ) *DZ2 
ELSE 

LI = KL-ND 
L2 = KL-2*ND 

XL = (3 . *X (KL, J) - 4 . *X (LI, J) +X (L2, J) ) *DZ2 
YL = (3 . *Y (KL, J) -4 . *Y (L1,J) +Y (L2,J) ) *DZ2 
ZL = (3.*Z(KL,J)-4.*Z(L1,J)+Z(L2,J))*DZ2 
END IF 

XX (J, 1) = (YK*ZL-ZK*YL) *RJ 
XX (J, 2) = (ZK*XL-XK*ZL) *RJ 
XX (J, 3) = (XK*YL-YK*XL) *RJ 

XX (J, 4) = -OMEGA* (Z (KL, J) *XX ( J, 2 ) -Y (KL, J) *XX ( J, 3) ) 
10 CONTINUE 


CALL 

RECV 

(J 

1) 

CALL 

RECV 

(RJ 

1) 

CALL 

RECV 

(XK 

. 1) 

CALL 

RECV 

(YK 

1) 

CALL 

RECV 

(ZK 

1) 

CALL 

RECV 

(Kl 

1) 

CALL 

RECV 

(K2 

1) 

CALL 

RECV 

(XL 

1) 

CALL 

RECV 

(YL 

1) 

CALL 

RECV 

(ZL 

1) 

CALL 

RECV 

(LI 

1) 

CALL 

RECV 

(L2 

1) 


KQQ=J2 

LZZ=Jl+0+NQQ0 
MZZ=MIN(LZZ+0+2*NQQ0-l, J2) 

CALL RECV (XX (KZZ, *) KZZ=LZZ,MZZ, 2) 

LZZ=Jl + 0+2 *NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l, J2) 



CALL RECV (XX (KZZ, * ) KZZ=LZZ , MZZ, 3) 

LZZ=Jl+0+3*NQQ0 

MZZ=MIN (LZZ+0 + 4 *NQQ0~1 , J2 ) 

CALL RECV (XX(KZZ, *)KZZ=LZZ / MZZ,4) 
END 



FILE 


2 


{Processor 2} 


C 


C 


C 


C 


C 


CALL RECV (J2 1) 

CALL RECV (J1 1) 

NQQ0=(J2- (Jl) +1-0) DIV 4+1 
LQQ0=Jl+0+NQQ0 
MQQ0=MIN (LQQ0+0+2*NQQ0-l, J2) 

DO 10 J=LQQ0/MQQ0 
IF (J .EQ. LQQO) THEN 

CALL RECV (KL 1) 


CALL RECV (Q (KL, * , KZZ ) KZZ=LQQ0 , MQQO , 1 ) 


END IF 

RJ = Q (KL, 6, J) 

IF (J .EQ. LQQO) THEN 

CALL RECV (KMAX 1) 

CALL RECV (K 1) 


END IF 

IF ( (K.NE. 1) .AND. (K.NE.KMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DY2 1) 

CALL RECV (KR 1) 


CALL RECV (KP 1) 

CALL RECV (X(*, KZZ )KZZ=LQQ0, MQQO, 1) 

END IF 

XK = (X (KP, J) -X (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV(Y (*, KZZ) KZZ=LQQ0, MQQO, 1) 
END IF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV ( Z (*, KZZ) KZZ=LQQ0, MQQO, 1) 

END IF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF(K.EQ.l) THEN 
K1 = KL+1 
K2 = KL+2 

XK = (-3.*X(KL, J)+4.*X(K1, J)-X(K2, J))*DY2 
YK = (-3 . *Y (KL,J) + 4 . *Y (K1,J) -Y (K2,J) ) *DY2 
ZK = (-3.*Z (KL, J)+4.*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE 

Kl = KL-1 
K2 = KL-2 

XK = (3.*X(KL, J) -4.*X(K1, J) +X(K2, J) ) *DY2 
YK = (3. *Y (KL, J) -4 .*Y(K1, J) +Y(K2, J) ) *DY2 
ZK = (3 . *Z (KL, J) -4 . *Z (Kl, J) +Z (K2 , J) ) *DY2 
END IF 

IF (J .EQ. LQQO) THEN 
CALL RECV (LMAX 


1 ) 



c 


c 


CALL RECV (L 1) 


END IF 

IF ( (L.NE.l) .AND. (L.NE.LMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DZ2 1) 

CALL RECV (LR 1) 

CALL RECV (LP 1) 


ENDIF 

XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP, J) -Y {LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (ND 1) 


ENDIF 

LI = KL+ND 
L2 = KL+2*ND 

XL = (-3 . *X (KL, J) +4 . *X (LI, J) -X (L2, J) ) *DZ2 
YL = (-3 . *Y (KL, J)+4.*Y(L1,J)-Y(L2,J) ) *DZ2 
ZL = (-3.*Z(KL, J)+4.*Z(L1,J)-Z(L2,J))*DZ2 
ELSE 

LI = KL-ND 
L2 = KL-2 *ND 

XL = (3 . *X (KL, J) -4.*X(L1,J) +X (L2 , J). ) *DZ2 
YL = (3.*Y(KL, J)-4.*Y(L1, J)+Y(L2, J))*DZ2 
ZL = (3.*Z(KL, J)-4.*Z(L1, J) +Z (L2 , J). ) *DZ2 
ENDIF 

XX (J, 1) = ( YK*ZL-ZK*YL) *RJ 
XX (J, 2) = (ZK*XL-XK*ZL) *RJ 
XX (J, 3) = (XK*YL-YK*XL) *RJ 
IF (J .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 


ENDIF 
XX (J, 4) = 
10 CONTINUE 
KQQ=J2 
IF (J .EQ 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 


-OMEGA* (Z (KL, J) *XX ( J, 2) -Y (KL, J) *XX ( J, 3) ) 


KQQ) THEN 


(J 

1) 

(RJ 

1) 

(XK 

1) 

(YK 

1) 

(ZK 

1) 

(K1 

1) 

(K2 

1) 

(XL 

1) 

(YL 

1) 

(ZL 

1) 

(LI 

1) 

(L2 

1) 



END IF 


CALL SEND (XX (KZZ, *) KZZ=LQQO , MQQO, 1) 
END 


FILE 


3 


{Processor 3} 


1 ) 

1 ) 


C 


C 


C 


C 


C 


CALL RECV (J2 
CALL RECV (J1 

NQQ0= ( J2- ( Jl) +1-0) DIV 4+1 
LQQ0=Jl+0+2*NQQ0 
MQQ0=MIN(LQQ0+0+3*NQQ0-l, J2) 

DO 10 J=LQQ0,MQQ0 
IF (J .EQ. LQQO) THEN 
CALL RECV (KL 1) 


CALL RECV (Q (KL, * , KZZ ) KZZ=LQQ0 , MQQO , 1) 


END IF 

RJ = Q (KL, 6, J) 

IF (J .EQ. LQQO) THEN 

CALL RECV (KMAX 1) 

CALL RECV (K 1) 


END IF 

IF ( (K.NE. 1) .AND. (K.NE.KMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DY2 1) 

CALL RECV (KR 1) 


CALL RECV (KP 1) 


CALL RECV (X (*, KZZ) KZZ=LQQ0, MQQO, 1) 
END IF 

XK = (X (KP, J) -X (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV ( Y (*, KZZ) KZZ=LQQ0 , MQQO, 1 ) 
END IF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV (Z (*, KZZ) KZZ=LQQ0 , MQQO, 1) 

END IF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF(K.EQ.l) THEN 
K1 = KL+1 
K2 = KL+2 

XK = (-3 . *X (KL, J)+4.*X(K1,J) -X(K2,J) ) *DY2 
YK = (-3.*Y(KL, J)+4.*Y(K1, J)-Y(K2, J) )*DY2 
ZK = (-3. *Z (KL, J) +4 . *Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE 

Kl = KL-1 
K2 = KL-2 

XK = (3. *X(KL, J) -4 . *X (Kl, J) +X (K2, J) ) *DY2 
YK = (3.*Y(KL, J) -4.*Y(K1, J) +Y(K2, J) ) *DY2 
ZK = (3.*Z (KL, J) -4.*Z (Kl, J) +Z (K2, J) ) *DY2 
END IF 

IF (J .EQ. LQQO) THEN 
CALL RECV (LMAX 


1 ) 



c 


CALL RECV (L 


1 ) 


END IF 

IF ( (L.NE.l) .AND. (L.NE.LMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DZ2 . 1) 

CALL RECV (LR 1) 

CALL RECV (LP 1) 


END IF 

XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP/ J) -Y (LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (ND 1) 


END IF 

LI = KL+ND 
L2 = KL+2*ND 

XL = (-3.*X(KL, J) +4.*X(L1, J) -X(L2, J) ) *DZ2 
YL = (-3 . *Y (KL, J) + 4 . *Y (LI, J) -Y (L2, J) ) *DZ2 
ZL = (-3.*Z(KL,J)+4.*Z(L1,J)-Z(L2,J))*DZ2 
ELSE 

LI = KL-ND 
L2 = KL-2*ND 

XL = (3.*X(KL,J)-4.*X(L1,J)+X(L2,J))*DZ2 
YL = (3 . *Y (KL, J) -4 . *Y (LI , J) +Y (L2 , J) ) *DZ2 
ZL = (3 . *Z (KL, J)-4.*Z(L1,J)+Z(L2,J))*DZ2 
END IF 

XX (J, 1) = (YK*ZL-ZK*YL) *RJ 
XX (J, 2) = (ZK*XL-XK*ZL) *RJ 
XX (J, 3) = (XK*YL-YK*XL) *RJ 
IF (J .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 


END IF 

XX ( J/ 4 ) = -OMEGA* (Z (KL, J) *XX ( J, 2) -Y (KL, J) *XX ( J, 3) ) 

10 CONTINUE 


KQQ=J2 
IF (J .EQ. 
CALL SEND 

KQQ) THEN 
(J 

1) 

CALL 

SEND 

(RJ 

1) 

CALL 

SEND 

(XK 

1) 

CALL 

SEND 

(YK 

1) 

CALL 

SEND 

(ZK 

1) 

CALL 

SEND 

(K1 

1) 

CALL 

SEND 

(K2 

1) 

CALL 

SEND 

(XL 

1) 

CALL 

SEND 

(YL 

1) 

CALL 

SEND 

(ZL 

1) 

CALL 

SEND 

(LI 

1) 

CALL 

SEND 

(L2 

1) 



END IF 


CALL SEND (XX (KZZ, * ) KZZ=LQQO , MQQO , 1) 
END 



FILE 


4 


{Processor 4} 


C 


C 


C 


C 


C 


CALL RECV ( J2 1) 

CALL RECV (J1 1) 

NQQ0= ( J2- ( Jl) +1-0) DIV 4+1 
LQQ0=Jl+0+3*NQQ0 
MQQ0=J2 

DO 10 J=LQQ0, MQQO 
IF (J .EQ. LQQO) THEN 

CALL RECV (KL 1) 


CALL RECV (Q (KL, *, KZZ) KZZ=LQQ0, MQQO, 1) 


END IF 

RJ = Q (KL, 6, J) 

IF (J .EQ. LQQO) THEN 

CALL RECV (KMAX 1) 

CALL RECV <K 1) 


ENDIF 

IF ( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DY2 1) 

CALL RECV <KR 1) 


CALL RECV (KP 1) 


CALL RECV (X ( * , KZZ) KZZ=LQQ0 , MQQO , 1 ) ' 
ENDIF 

XK = (X (KP, J) -X (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV (Y (*, KZZ) KZZ=LQQ0, MQQO, 1) 
ENDIF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (J .EQ. LQQO) THEN 


CALL RECV(Z (*, KZZ) KZZ=LQQ0, MQQO, 1) 

ENDIF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF(K.EQ.l) THEN 
K1 = KL+1 
K2 = KL+2 

XK = (-3 . *X (KL, J) +4 . *X (Kl, J) -X (K2, J) ) *DY2 
YK = (-3.*Y (KL, J) +4. *Y (Kl, J) -Y (K2, J) ) *DY2 
ZK = (-3.*Z (KL, J) +4.*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE 

Kl = KL-1 
K2 = KL-2 

XK = (3 . *X (KL, J) -4 . *X (Kl, J) +X (K2, J) ) *DY2 
YK = (3 . *Y (KL, J) -4 . *Y (Kl, J) +Y (K2, J) ) *DY2 
ZK = (3 . *Z (KL, J)-4.*Z(K1,J)+Z (K2, J) ) *DY2 
ENDIF 

IF (J .EQ. LQQO) THEN 
CALL RECV (LMAX 


1 ) 



c 


CALL RECV (L 1) 


ENDIF 

IF ( (L.NE.l) .AND. (L.NE.LMAX) ) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (DZ2 1) 

CALL RECV (LR 1) 

CALL RECV (LP 1) 


ENDIF 

XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y(LP, J)-Y(LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
IF (J .EQ. LQQO) THEN 

CALL RECV (ND 1) 


ENDIF 

LI = KL+ND 
L2 = KL+2*ND 

XL = (-3.*X(KL,J)+4.*X(L1,J)-X(L2,J))*DZ2 
YL = (-3 . *Y (KL, J) +4 . *Y (LI , J) - Y (L2,J) ) *DZ2 
ZL = (-3.*Z(KL, J)+4.*Z(L1, J) -Z(L2, J) )*DZ2 
ELSE 

LI = KL-ND 
L2 = KL-2*ND 

XL = (3. *X(KL, J) -4. *X (LI, J) +X (L2, J) ) *DZ2 
YL = (3.*Y (KL, J) -4. *Y (LI, J) +Y (L2, J) ) *DZ2 
ZL = (3.*Z(KL, J)-4.*Z (LI, J)+Z (L2, J) ) *DZ2 
ENDIF 

XX(J,1) = ( YK*ZL-ZK*YL) *RJ 
XX (J, 2) = (ZK*XL-XK*ZL) *RJ 
XX (J, 3) = (XK*YL-YK*XL) *RJ 
IF (J .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 


ENDIF 

XX (J, 4) = -OMEGA* (Z (KL, J) *XX(J, 2) -Y (KL, J) *XX(J, 3) ) 
10 CONTINUE 
KQQ=J2 

IF (J .EQ. KQQ) THEN 


CALL SEND (J 1) 
CALL SEND (RJ 1) 
CALL SEND (XK 1) 
CALL SEND (YK 1) 
CALL SEND (ZK 1) 
CALL SEND (K1 1) 
CALL SEND (K2 1) 
CALL SEND (XL 1) 
CALL SEND (YL 1) 
CALL SEND (ZL 1) 
CALL SEND (LI 1) 
CALL SEND (L2 1) 


END IF 


CALL SEND (XX (KZZ , * ) KZZ=LQQO , MQQO , 1 ) 
END 



APPENDIX C. SUBROUTINE ' YYM' 
a) . Sequential Codes 


THEN 


DX2 = . 5/DX1 
DZ2 = . 5/DZ1 
JP = J+l 
JR = J-l 
DO 10 K = K1,K2 
KL = (L-l) *ND+K 
LP = KL+ND 
LR = KL-ND 
RJ = Q (KL/ 6, J) 

IF ( (J.NE. 1) .AND. (J.EQ. JMAX) ) 

XJ = (X(KL, JP) -X(KL / JR) ) *DX2 
YJ = <Y(KL, JP) -Y(KL, JR) ) *DX2 
ZJ = (Z (KL, JP) ~Z (KL, JR) ) *DX2 
ELSE IF (J.EQ. 1) THEN 
J1 = J+l 
j2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, Jl) -X (KL, J2) ) *DX2 
YJ = (- 3 . *Y (KL, J) +4 . *Y (KL, Jl) -Y (KL, J2) ) *DX2 

ZJ = (~3.*Z(KL,J)+4.*Z(KL,J1)-Z(KL,J2))*DX2 

ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 

XJ = *(3%X (KL, J) -4 .*X(KL, Jl) +X(KL, J2) ) *DX2 
YJ = (3.*Y (KL, J) -4 .*Y (KL, Jl) +Y (KL, J2) ) *DX2 
ZJ = (3.*Z (KL, J) -4.*Z (KL, Jl) +Z (KL, J2) ) *DX2 
END IF 

IF ( (L.NE.l) .AND. (L.EQ.LMAX) ) THEN 
XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP, J) ~Y (LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
LI = KL+ND 

L2 = KL+2*ND 

XL = (-3 . *X (KL, J) +4.*X(L1,J) -X(L2,J) ) *DZ2 
YL = (-3.*Y(KL, J)+4.*Y (LI, J) -Y(L2, J) ) *DZ2 

ZL = (-3.*Z(KL, J)+4.*Z (LI, J) -Z(L2, J) ) *DZ2 

ELSE IF (L.EQ.LMAX) THEN 
LI = KL-ND 

L2 = KL-2*ND 

XL = (3.*X(KL, J) -4.*X(L1, J) +X(L2, J) ) *DZ2 

YL = (3 . *Y (KL, J) -4 . * Y (L1,J)+Y (L2,J) ) *DZ2 

ZL = (3.*Z(KL,J)-4.*Z(L1,J)+Z(L2,J))*DZ2 


= (3. 
ENDIF 
YY (K, 1 ) = 
YY (K, 2) = 
YY (K, 3) = 
YY (K, 4 ) = 
10 CONTINUE 


(Z J*YL-YJ*ZL) *RJ 
(XJ*ZL-XL*ZJ) *RJ 
( Y J*XL-XJ*YL) *RJ 

-OMEGA* (Z (KL, J) *YY (K, 2) -Y (KL, J) *YY (K, 3) ) 


RETURN 

END 


9 


9 



APPENDIX C (continue) 
b) . Parallel Code 
FILE 1 {processor 1} 


CALL 

SEND 

(K2 

2) 

CALL 

SEND 

(K1 

2) 

CALL 

SEND 

(K2 

3) 

CALL 

SEND 

(K1 

3) 

CALL 

SEND 

(K2 

4) 

CALL 

SEND 

(K1 

4) 

CALL 

SEND 

(L 

2) 

CALL 

SEND 

(L 

3) 

CALL 

SEND 

(L 

4) 

CALL 

SEND 

(J 

2) 

CALL 

SEND 

(J 

3) 

CALL 

SEND 

(J 

4) 

DX2 = 

= . 5/DX1 


CALL 

SEND 

(ND 

2) 

CALL 

SEND 

(ND 

3) 

CALL 

SEND 

(ND 

4) 

CALL 

SEND 

( JMAX 

2) 

CALL 

SEND 

( JMAX 

3) 

CALL 

SEND 

(JMAX 

4) 

CALL 

SEND 

(DX2 

2) 

CALL 

SEND 

(DX2 

3) 

CALL 

SEND 

(DX2 

4) 

CALL 

SEND 

(LMAX 

2) 

CALL 

SEND 

(LMAX 

3) 

CALL 

SEND 

(LMAX 

4) 

CALL 

SEND 

(OMEGA 

2) 

CALL 

SEND 

(OMEGA 

3) 

CALL 

SEND 

(OMEGA 

4) 

DZ2 = 

= . 5/DZ1 


CALL 

SEND 

(DZ2 

2) 

CALL 

SEND 

(DZ2 

3) 

CALL 

SEND 

(DZ2 

4) 

JP = 

J+l 



CALL 

SEND 

(JP 

2) 

CALL 

SEND 

(JP 

3) 

CALL 

SEND 

(JP 

4) 

JR = 

J-l 



CALL 

SEND 

(JR 

2) 

CALL 

SEND 

(JR 

3) 

CALL 

SEND 

(JR 

4) 

NQQ0= 

= (K2- 

(Kl) +1-0) DIV 4+1 



LQQ0=K1 

MQQ0=MIN(LQQ0+0+l*NQQ0-l,K2) 

LZZ=Kl+0+NQQ0 

MZZ=MIN(LZZ+0+2*NQQ0-l,K2) 

CALL SEND (Q (*, *, J) KZZ=LZZ,MZZ, 2) 
LZZ=Kl+0+2*NQQ0 



MZZ=MIN(LZZ+0+3*NQQ0-l,K2) 

CALL SEND (Q ( *, *, J) KZZ=LZZ, MZZ, 3) 

LZZ=Kl+0+3*NQQ0 

MZZ=MIN (LZZ+0+4*NQQ0-l,K2) 

CALL SEND(Q(*,*, J) KZZ=LZZ, MZZ, 4 ) 

LZZ=Kl+0+NQQ0 

MZZ=MIN(LZZ+0+2*NQQ0-l,K2) 

CALL SEND (X (*, *) KZZ=LZZ,MZZ, 2) 

LZZ=Kl+0+2*NQQ0 

MZZ=MIN(LZZ+0+3*NQQ0-l,K2) 

CALL SEND (X (*, *) KZZ=LZZ,MZZ, 3) 

LZZ=Kl+0+3*NQQ0 

MZZ=MIN(LZZ+0+4*NQQ0-l,K2) 

CALL SEND (X (*, *) KZZ=LZZ, MZZ, 4) 

LZZ=Kl+0+NQQ0 

MZZ=MIN(LZZ+0+2*NQQ0-l,K2) 

CALL SEND (Y (*, *) KZZ=LZZ, MZZ, 2) 

LZZ=Kl+0+2*NQQ0 

MZZ=MIN(LZZ+0+3*NQQ0-l,K2) 

CALL SEND (Y (*, *) KZZ=LZZ,MZZ, 3) 

LZZ=Kl+0+3*NQQ0 

MZZ=MIN(LZZ+0+4*NQQ0-l,K2) 

CALL SEND (Y ( * , * ) KZZ=LZZ, MZZ, 4) 

LZZ=Kl+0+NQQ0 

MZZ=MIN(LZZ+0+2*NQQ0-l / K2) 

CALL SEND (Z (*, * ) KZZ=LZZ , MZZ , 2 ) 

LZZ=Kl+0+2*NQQ0 

MZZ=MIN(LZZ+0+3*NQQ0-l / K2) 

CALL SEND (Z (*, *) KZZ=LZZ,MZZ, 3) 

LZZ=Kl+0+3*NQQ0 

MZZ=MIN(LZZ+0+4*NQQ0-l / K2) 

CALL SEND (Z (*, * ) KZZ=LZZ, MZZ , 4 ) 

DO 10 K=LQQ0,MQQ0 
KL = (L-1)*ND+K 
LP = KL+ND 
LR = KL-ND 
RJ = Q (KL, 6, J) 

IF { (J.NE.l) .AND. (J.EQ. JMAX) ) THEN 
XJ = (X (KL, JP) -X (KL, JR) ) *DX2 
YJ = (Y (KL, JP) -Y (KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF (J.EQ. 1) THEN 
J1 = J+l 
J2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, Jl) -X (KL, J2) ) *DX2 
YJ = (-3 . *Y (KL, J) +4 . *Y (KL, Jl) -Y (KL, J2) ) *DX2 
ZJ = (-3 . *Z (KL, J) +4 . *Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 
J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 
YJ = (3 . *Y (KL, J) -4 . *Y (KL, Jl) +Y (KL, J2) ) *DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 


END IF 

IF ( (L.NE.l) .AND. (L.EQ.LMAX) ) THEN 

XL = (X (LP/ J) -X (LR, J) ) *DZ2 

YL = (Y(LP,J)-Y(LR,J))*DZ2 

ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 

ELSE IF(L.EQ.l) THEN 

LI = KL+ND 

L2 = KL+2*ND 

XL = (-3 . *X (KL, J) +4 . *X (LI, J) -X (L2, J) ) *DZ2 
YL = (-3 . *Y (KL, J) +4 . *Y (LI, J) -Y (L2, J) ) *DZ2 
ZL = (-3.*Z(KL,J)+4.*Z(L1,J)-Z(L2,J))*DZ2 
ELSE IF (L.EQ.LMAX) THEN 
LI = KL-ND 
L2 = KL-2*ND 

XL = (3.*X(KL, J) -4.*X(L1, J) +X(L2, J) ) *DZ2 
YL = (3 . *Y (KL, J)-4.*Y(L1,J)+Y(L2,J))*DZ2 
ZL = (3.*Z(KL,J)-4.*Z(L1,J)+Z(L2,J))*DZ2 
END IF 

YY (K, 1 ) = (ZJ*YL-YJ*ZL) *RJ ' 

YY (K, 2) = (XJ*ZL-XL*Z J) *RJ 
YY (K, 3) = ( YJ*XL-XJ* YL) *RJ 

YY (K, 4 ) = -OMEGA* (Z (KL, J) *YY(K,2) -Y(KL, J) *YY(K,3) ) 
10 CONTINUE 


CALL 

RECV 

(K 

1) 

CALL 

RECV 

(KL 

1) 

CALL 

RECV 

(LP 

1) 

CALL 

RECV 

(LR 

1) 

CALL 

RECV 

(RJ 

1) 

CALL 

RECV 

(XJ 

1) 

CALL 

RECV 

(YJ 

1) 

CALL 

RECV 

(ZJ 

1) 

CALL 

RECV 

(J1 

1) 

CALL 

RECV 

(J2 

1) 

CALL 

RECV 

(XL 

1) 

CALL 

RECV 

(YL 

1) 

CALL 

RECV 

(ZL 

1) 

CALL- 

RECV 

(LI 

1) 

CALL 

RECV 

(L2 

1) 


KQQ=K2 

LZZ=Kl+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l,K2) 

CALL RECV (YY (KZZ, *) KZZ=LZZ,MZZ, 2) 

LZZ=Kl+0+2*NQQ0 

MZZ=MIN(LZZ+0+3*NQQ0-l,K2) 

CALL RECV (YY (KZZ, *) KZZ=LZZ,MZZ, 3) 

LZZ=Kl+0+3*NQQ0 

MZZ=MIN (LZZ+0+4*NQQ0-l,K2) 

CALL RECV (YY (KZZ, *) KZZ=LZZ,MZZ, 4) 
END 



FILE 


2 


{Processor 2} 


CALL RECV (K2 
CALL RECV <K1 

NQQ0= (K2- (Kl) +1-0) DIV4+1 
LQQ0=Kl+0+NQQ0 
MQQ0=MIN(LQQ0+0+2*NQQ0-l,K2) 

DO 10 K=LQQ0, MQQO 
IF (K .EQ. LQQO) THEN 
CALL RECV (ND 
CALL RECV (L 

C 

ENDIF 

KL = ( L— 1 ) *ND+K 
LP = KL+ND 
LR = KL-ND 

IF (K .EQ. LQQO) THEN 
CALL RECV (J 1) 

C 

CALL RECV (Q ( * , * , J) KZZ=LQQ0 , MQQO , 1 ) 

ENDIF 

RJ = Q (KL, 6, J) 

IF (K .EQ. LQQO) THEN 

CALL RECV ( JMAX 1) 

C 

ENDIF 

IF ( (J.NE.l) .AND. (J.EQ. JMAX) ) THEN 
IF (K .EQ. LQQO) THEN 

CALL RECV (DX2 1) 

CALL RECV (JR 1) 

C 

CALL RECV (JP 1) 

CALL RECV (X ( * , * ) KZZ=LQQ0 , MQQO , 1 ) 

ENDIF 

XJ = (X (KL, JP) -X (KL, JR) ) *DX2 
IF (K .EQ. LQQO) THEN 

C 

CALL RECV (Y (*,*) KZZ=LQQ0, MQQO, 1) 

ENDIF 

YJ = (Y(KL, JP) -Y(KL / JR) ) *DX2 
IF (K .EQ. LQQO) THEN 

c 

CALL RECV (Z (*, *)KZZ=LQQ0, MQQO, 1) 

ENDIF 

ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF (J.EQ. 1) THEN 
J1 = J+l 
J2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, Jl) -X (KL, J2) ) *DX2 
YJ = (-3 . * Y (KL, J) +4 . * Y (KL, Jl) -Y (KL, J2) ) *DX2 
ZJ = (-3.*Z (KL, J) +4 .*Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 


1 ) 

1 ) 


1 ) 

1 ) 



J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 
YJ = (3.*Y(KL,J)-4.*Y(KL / J1)+Y(KL / J2))*DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 • 
END IF 

IF (K .EQ. LQQO) THEN 

CALL RECV (LMAX 1) 


END IF 

IF ( (L.NE.l) .AND. (L.EQ.LMAX) ) THEN 
IF (K .EQ. LQQO) THEN 

CALL RECV (DZ2 1) 


END IF 

XL = (X (LP, J) -X (LR, J) ) *DZ2 
YL = (Y (LP, J) -Y (LR, J) ) *DZ2 
ZL = (Z(LP,J)-Z(LR, J))*DZ2 
ELSE IF(L.EQ.l) THEN 
LI = KL+ND 
L2 = KL+2 *ND 

XL = (-3 . *X (KL, J) +4 . *X (LI , J) -X (L2 , J) ) *DZ2 
YL = (-3 . *Y (KL, J) +4 . *Y (LI , J) -Y (L2, J) ) *DZ2 
ZL = (-3 . *Z (KL, J) +4.*Z (L1,J) -Z (L2,J) ) *DZ2 
ELSE IF (L.EQ.LMAX) THEN 
LI = KL-ND 
L2 = KL-2*ND 

XL = (3 . *X (KL, J)-4.*X(L1,J)+X(L2,J) )*DZ2 
YL = ( 3 . * Y (KL, J)-4.*Y(L1,J) +Y (L2 , J) ) *DZ2 
ZL = (3.*Z (KL, J) -4.*Z (LI, J) +Z (L2, J) ) *DZ2 
END IF 

YY (K, 1 ) = (ZJ*YL-YJ*ZL) *RJ 
YY (K, 2) = (XJ*ZL-XL*ZJ) *RJ 
YY (K, 3) = ( Y J*XL-XJ*YL) *RJ 
IF (K .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 

c 

END IF 

YY (K, 4 ) = -OMEGA* (Z (KL, J) *YY (K, 2) -Y (KL, J) *YY (K, 3) ) 
10 CONTINUE 
KQQ=K2 

IF (K .EQ. KQQ) THEN 


CALL SEND (K 1) 
CALL SEND (KL 1) 
CALL SEND (LP 1) 
CALL SEND (LR 1) 
CALL SEND (RJ 1) 
CALL SEND (XJ 1) 
CALL SEND (YJ 1) 
CALL SEND (ZJ D 
CALL SEND (Jl 1) 
CALL SEND (J2 • 1) 
CALL SEND (XL 1) 
CALL SEND (YL 1) 



c 


CALL 

SEND 

(ZL 

1) 

CALL 

SEND 

(LI 

1) 

CALL 

SEND 

(L2 

1) 


ENDIF 


CALL SEND (YY (KZZ, *) KZZ=LQQO , MQQO , 1) 
END 


9 


9 



FILE 

3 {Processor 3} 



CALL RECV (K2 

1) 


CALL RECV (K1 

NQQ0= (K2- (Kl) +1-0) DIV 4+1 
LQQ0=Kl+0+2*NQQ0 
MQQ0=MIN(LQQ0+0+3*NQQ0-l,K2) 
DO 10 K=LQQ0,MQQ0 

1) 


IF (K .EQ. LQQO) THEN 
CALL RECV <ND 

1) 

r* 

CALL RECV (L 

1) 

U 

END IF 

KL = (L-l ) *ND+K 
LP = KL+ND 
LR = KL-ND 

IF (K .EQ. LQQO) THEN 
CALL RECV (J 

1) 


CALL RECV (Q ( * , * , J) KZZ=LQQ0 , MQQO , 1 ) ' 
END IF 

RJ = Q (KL, 6, J) 

IF (K .EQ. LQQO) THEN 
CALL RECV ( JMAX 

1) 

u 

END IF 

IF ( (J.NE.l) .AND. (J.EQ. JMAX) ) THEN 
IF (K .EQ. LQQO) THEN 
CALL RECV (DX2 

1) 

r* 

CALL RECV (JR 

1) 

U 

CALL RECV (JP 

CALL RECV (X ( * , * ) KZZ=LQQ0 , MQQO/ 1 ) 
END IF 

XJ = (X(KL, JP) -X(KL, JR) ) *DX2 
IF (K .EQ. LQQO) THEN 

1) 

O 

CALL RECV (Y(*, *)KZZ=LQQ0, MQQO, 1) 
END IF 

YJ = (Y (KL, JP) -Y (KL, JR) ) *DX2 
IF (K .EQ. LQQO) THEN 


L. 

CALL RECV (Z (*, *) KZZ=LQQ0 , MQQO , 1) 

END IF 

ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF (J.EQ. 1) THEN 
J1 = J+l 
J2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, J1 ) -X (KL, J2) ) 

*DX2 


YJ = (-3 . *Y (KL, J) +4 . *Y (KL, J1 ) -Y (KL, J2 ) ) 

*DX2 


ZJ = (-3.*Z (KL, J) +4.*Z (KL, Jl) -Z (KL, J2) ) 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 

*DX2 





then 


1) 


1) 


J 2 = n 2 

XJ = (3 
YJ = (3 
ZJ * < 3 

"(K EQ. LQQO) 
call recv_(lmax 

"endif md . ( l.eq.lmax)> THEtI 

f F l lK ®i wf THEN 

£li KECV_(DZ 2 

END IF T\ -X (LR/ J) ) *DZ2 

XL = ^ X ^p' j\ -Y (LR/ J) ) * DZ ^ 

(Y t?' 7 -z (lr»J))* dz2 

illl ! eQ.D ™ en 

else su:ssssSo «* 

• & 


YL = 
ZL - 

else 
Ll = 
L2 = 
XL = 
YL * 
ZL = 


;lix® 

- 3 • * Y ‘S' 5 '4 4 Li j> +z < L2 ' J) ' D 

= (3.*Z(KL,J) * * 


L2 
XL 
YL 

ZL = ( 3 
END IF 

YY (K,l) ' 
YY(K,2) = 
YY (K/3) - 
IF (K -EQ 

CALL RECV 


( Z J* YL"Y J* ZL) *HJ 

(XJ*ZL"XL*ZJ) E 

yj*xl-xj*xl) *RJ 
LQQO) then 
(OMEGA 


1) 


CALL RECV v-— 

*yy ( K 2 )-Z(KL,C 0 *«< K ' 3>> 

= -OMEGA* <Z (KL, J) *™ < K ' 

10 CONTINUE 

rf aa— V 9 


KQQ=K2 
!F (K .EQ. 

CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 
CALL SEND 


KQQ) 

(K 

(KL 

(LP 

(LR 

(RJ 

(XJ 

(XJ 

(ZJ 

(JI 

( J2 
(XL 
(XL 


then 


1) 

1) 

1) 

1) 

1) 

1) 

1) 

1) 

1) 

1) 

1) 

1) 


1 ) 

1 ) 

1 ) 


CALL SEND <ZL 
CALL SEND (LI 
CALL SEND (L2 
ENDIF 


CALL SEND (YY (KZZ, *) KZZ=LQQO , MQQO , 1) 
END 


9 


FILE 


4 


{Processor 4} 


CALL RECV (K2 
CALL RECV (K1 

NQQ0= (K2- (Kl) +1-0) DIV4+1 
LQQ0=Kl+0+3*NQQ0 
MQQ0=K2 

DO 10 K=LQQ0,MQQ0 
IF (K .EQ. LQQO) THEN 
CALL RECV (ND 
CALL RECV <L 

C 

END IF 

KL = (L-l) *ND+K 
LP = KL+ND 
LR = KL-ND 

IF (K .EQ. LQQO) THEN 

CALL RECV (J 1) 

c 

CALL RECV (Q ( * , * , J) KZZ=LQQ0 , MQQO , 1 ) 

END IF 

RJ = Q (KL, 6/ J) 

IF (K .EQ. LQQO) THEN 

CALL RECV (JMAX 1) 

C 

ENDIF 

IF ( (J.NE.l) .AND. (J.EQ. JMAX) ) THEN 
IF (K .EQ. LQQO) THEN 
CALL RECV (DX2 

CALL RECV (JR 

C 

CALL RECV (JP 1) 

CALL RECV(X(*, *) KZZ=LQQ0,MQQ0, 1) 

ENDIF 

XJ = (X(KL, JP) -X(KL, JR) ) *DX2 
IF (K .EQ. LQQO) THEN 

C 

CALL RECV (Y <*, *) KZZ=LQQ0, MQQO, 1) 

ENDIF 

YJ = (Y (KL, JP) -Y (KL, JR) ) *DX2 
IF (K .EQ. LQQO) THEN 

C 

CALL RECV (Z (*, *) KZZ=LQQ0,MQQ0, 1) 

ENDIF 

ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF (J.EQ. 1) THEN 
J1 = J+l 
J2 = J+2 

XJ = (-3.*X(KL, J)+4.*X(KL, J1)-X(KL, J2) ) *DX2 
YJ = (-3 . *Y (KL, J) +4 . *Y (KL, Jl) -Y (KL, J2) ) *DX2 
ZJ = (-3. *Z (KL, J) +4.*Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 


1 ) 

1 ) 


1 ) 

1 ) 


1 ) 

1) 



c 


c 


c 


J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, J1 ) +X (KL, J2 ) ) *DX2 
YJ = (3.*Y(KL, J)-4.*Y(KL / J1)+Y(KL,J2) ) *DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 
END IF 

IF (K .EQ. LQQO) THEN 

CALL RECV (LMAX 1) 


END IF 

IF ( (L.NE.l) .AND. (L.EQ.LMAX) ) THEN 
IF (K .EQ. LQQO) THEN 

CALL RECV (DZ2 1) 


END IF 

XL = (X (LP/ J) -X (LR/ J) ) *DZ2 
YL = (Y (LP, J) -Y (LR, J) ) *DZ2 
ZL = (Z (LP, J) -Z (LR, J) ) *DZ2 
ELSE IF(L.EQ.l) THEN 
LI = KL+ND 
L2 = KL+2*ND 

XL = (-3 . *X (KL, J) +4.*X(L1,J) -X(L2,J) ) *DZ2 
YL = (-3 . *Y (KL, J)+4.*Y(L1,J) -Y (L2, J) ) *DZ2 
ZL = (-3. *Z (KL, J) +4 . *Z (LI, J) -Z (L2, J) ) *DZ2 
ELSE IF (L.EQ.LMAX) THEN 
LI = KL-ND 
L2 = KL-2*ND 

XL = (3 . *X (KL, J) -4 . *X (LI, J) +X (L2, J) ) *DZ2 
YL = (3 . *Y (KL, J) -4 . *Y (LI, J) +Y (L2, J) ) *DZ2 
ZL = (3.*Z(KL, J)-4.*Z(L1, J)+Z(L2, J))*DZ2 
END IF 

YY (K, 1) = (ZJ*YL-YJ*ZL) *RJ 
YY (K, 2) = (XJ*ZL-XL*Z J) *RJ 
YY (K, 3) = ( Y J*XL-XJ* YL) *RJ 
IF (K .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 


END IF 

YY (K, 4 ) = -OMEGA* (Z (KL, J) *YY(K, 2) -Y(KL, J) *YY(K,3) ) 
10 CONTINUE 
KQQ=K2 

IF (K .EQ. KQQ) THEN 


CALL 

SEND 

(K 

1) 

CALL 

SEND 

(KL 

1) 

CALL 

SEND 

(LP 

1) 

CALL 

SEND 

(LR 

1) 

CALL 

SEND 

(RJ 

1) 

CALL 

SEND 

(XJ 

1) 

CALL 

SEND 

(YJ 

1) 

CALL 

SEND 

(ZJ 

1) 

CALL 

SEND 

(Jl 

1) 

CALL 

SEND 

(J2 

1) 

CALL 

SEND 

(XL 

1) 

CALL 

SEND 

(YL 

1) 



1 ) 

1 ) 

1 ) 


CALL SEND (ZL 
CALL SEND (LI 
CALL SEND (L2 
END IF 


CALL SEND (YY (KZZ, * ) KZZ=LQQO , MQQO , 1) 
END 


9 



APPENDIX D. SUBROUTINE 'ZZM' 


a) . Sequential Code 


K = M 

DX2 = . 5/DX1 
DY2 = . 5/DY1 
JP = J+l 
JR = J-l 
DO 10 L = LI , L2 
KL = (L-1)*ND+K 
KP = KL+1 
KR = KL-1 
RJ = Q (KL, 6, J) 

IF ( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 

XK = (X (KP, J) -X (KR, J) ) *DY2 

YK = (Y(KP, J) -Y(KR, J) ) *DY2 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 

ELSE IF (K . EQ . 1) THEN 

K1 = KL+1 

K2 = KL+2 

XK = (~3.*X(KL, J)+4.*X(K1, J)-X(K2, J)) *DY2 
YK = (-3.*Y(KL,J)+4.*Y(K1,J)-Y(K2,J))*DY2 
ZK = (-3.*Z (KL, J)+4.*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE IF (K.EQ.KMAX) THEN 
Kl = KL-1 
K2 = KL-2 

XK = (3. *X (KL, J) -4. *X (Kl, J) +X(K2, J) ) *DY2 
YK = (3.*Y(KL,J)-4.*Y(K1,J)+Y(K2, J))*DY2 
ZK = (3.*Z(KL,J)-4.*Z(K1,J)+Z(K2,J))*DY2 
END IF 

IF ( (J.NE.l) .AND. (J.NE. JMAX) ) THEN 
XJ = (X (KL, JP) -X (KL, JR) ) *DX2 
YJ = (Y(KL, JP) -Y(KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF(J.EQ.l) THEN 
J1 = J+l 
J2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, Jl) -X (KL, J2) ) *DX2 
YJ = (-3 . *Y (KL, J) +4 . *Y (KL, Jl) -Y (KL, J2) ) *DX2 
ZJ = (-3.*Z (KL, J) +4.*Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 
J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 
YJ = (3.*Y (KL, J) -4.*Y(KL, Jl) +Y (KL, J2) ) *DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 
END IF 

ZZ(L, 1) = ( YJ* ZK-Z J* YK) *RJ 
ZZ (L, 2) = (XK*ZJ-XJ*ZK) *RJ 
ZZ (L, 3) = (XJ*YK- YJ*XK) *RJ 

ZZ (L, 4) = -OMEGA* (Z (KL, J) *ZZ (L, 2) -Y (KL, J) *ZZ (L, 3) ) 



10 CONTINUE 
RETURN 
END 



APPENDIX D. (continue) 
b) . Parallel Code 


FILE 1 {Processor 1} 


CALL 

SEND 

(L2 

2) 

CALL 

SEND 

(LI 

2) 

CALL 

SEND 

(L2 

3) 

CALL 

SEND 

(LI 

3) 

CALL 

SEND 

(L2 

4) 

CALL 

SEND 

(LI 

4) 

CALL 

SEND 

(J 

2) 

CALL 

SEND 

(J 

3) 

CALL 

SEND 

(J 

4) 

K = M 



CALL 

SEND 

(K 

2) 

CALL 

SEND 

(K 

3) 

CALL 

SEND 

(K 

4) 

CALL 

SEND 

(ND 

2) 

CALL 

SEND 

(ND 

3) 

CALL 

SEND 

(ND 

4) 

CALL 

SEND 

(KMAX 

2) 

CALL 

SEND 

(KMAX 

3) 

CALL 

SEND 

(KMAX 

4) 

CALL 

SEND 

( JMAX 

2) 

CALL 

SEND 

( JMAX 

3) 

CALL 

SEND 

(JMAX 

4) 

CALL 

SEND 

(OMEGA 

2) 

CALL 

SEND 

(OMEGA 

3) 

CALL 

SEND 

(OMEGA 

4) 

DX2 = 

= . 5/DX1 


CALL 

SEND 

(DX2 

2) 

CALL 

SEND 

(DX2 

3) 

CALL 

SEND 

(DX2 

4) 

DY2 = 

= . 5/DY1 


CALL 

SEND 

(DY2 

2) 

CALL 

SEND 

(DY2 

3) 

CALL 

SEND 

(DY2 

4) 

JP = 

J+l 



CALL 

SEND 

(JP 

2) 

CALL 

SEND 

(JP 

3) 

CALL 

SEND 

(JP 

4) 

JR = 

J-l 



CALL 

SEND 

(JR 

2) 

CALL 

SEND 

(JR 

3) 

CALL 

SEND 

(JR 

4) 

NQQ0= 

= (L2- 

(LI) +1-0) DIV 4+1 



LQQ0=L1 

MQQ0=MIN(LQQ0+0+l*NQQ0-l,L2) 

LZZ=Ll+0+NQQ0 

MZZ=MIN(LZZ+0+2*NQQ0-l,L2) 



CALL SEND (Q (*, *, J) KZZ=LZZ, MZZ, 2) 

LZZ=Ll+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l, L2) 

CALL SEND(Q(*,* / J)KZZ=LZZ,MZZ / 3) 

LZZ=Ll+0+3*NQQ0 

MZZ=MIN(LZZ+0+4*NQQ0-l,L2) 

CALL SEND (Q (*, *, J) KZZ=LZZ,MZZ, 4) 

LZZ=Ll+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, L2) 

CALL SEND (X (*, *) KZZ=LZZ, MZZ, 2) 

LZZ=Ll+0+2*NQQ0 

MZZ=MIN(LZZ+0+3*NQQ0-l,L2) 

CALL SEND (X ( * , * ) KZZ=LZZ, MZZ, 3) 

LZZ=Ll+0+3*NQQ0 

MZZ=MIN (LZZ+0 + 4 *NQQ0-1 , L2 ) 

CALL SEND (X ( * , * ) KZZ=LZZ, MZZ, 4 ) 

LZZ=Ll+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, L2) 

CALL SEND (Y (*, *) KZZ=LZZ, MZZ, 2) 

LZZ=Ll+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l / L2) 

CALL SEND (Y (*, *) KZZ=LZZ,MZZ / 3) 

LZZ=Ll+0+3*NQQ0 

MZZ=MIN (LZZ+0 + 4 *NQQ0-1, L2) 

CALL SEND (Y (* , *) KZZ=LZZ, MZZ, 4) 

LZZ=Ll+0+NQQ0 

MZZ=MIN (LZZ+0+2*NQQ0-l, L2) 

CALL SEND (Z (*, *) KZZ=LZZ,MZZ, 2) 

LZZ=Ll+0+2*NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l , L2) 

CALL SEND (Z (*, *) KZZ=LZZ,MZZ, 3) 

LZZ=Ll+0+3*NQQ0 

MZZ=MIN (LZZ+0+4 *NQQ0-1 / L2) 

CALL SEND (Z (*, * ) KZZ=LZZ, MZZ, 4) 

DO 10 L=LQQ0, MQQO 
KL = (L-1)*ND+K 
KP = KL+1 
KR = KL-1 
RJ = Q (KL, 6, J) 

IF ( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 

XK = (X (KP, J) -X (KR, J) ) *DY2 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 

ELSE IF (K.EQ.l) THEN 

K1 = KL+1 

K2 = KL+2 

XK = (-3.*X(KL, J) +4.*X(K1, J) -X(K2/ J) ) *DY2 
YK = (-3 . *Y (KL, J) +4 . *Y (Kl, J) -Y (K2, J) ) *DY2 
ZK = (-3 . *Z (KL, J) +4.*Z (Kl, J) -Z (K2,J) ) *DY2 
ELSE IF (K.EQ.KMAX) THEN 
Kl = KL-1 
K2 = KL-2 

XK = (3.*X(KL,J)-4.*X(K1,J)+X(K2,J))*DY2 



YK = (3 . *Y (KL, J) -4 . *Y (Kl, J) +Y (K2, J) ) *DY2 
ZK = (3 . *Z (KL, J) -4 . *Z (Kl, J) +Z (K2, J) ) *DY2 
END IF 

IF ( (J.NE.l) .AND. (J.NE. JMAX) ) THEN 
XJ = (X (KL, JP) -X (KL, JR) ) *DX2 
YJ = (Y (KL, JP) -Y (KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF(J.EQ.l) THEN 
J1 = J+l 

J2 = J+2 

XJ = (-3.*X(KL, J)+4.*X(KL, Jl) -X(KL, J2) ) *DX2 
YJ = (-3 . *Y (KL, J) +4 . *Y (KL, J1 ) -Y (KL, J2) ) *DX2 
ZJ = (-3. *Z (KL, J) +4 . *Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 

J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 

YJ = (3 . *Y (KL, J) -4 . *Y (KL, Jl) +Y (KL, J2) ) *DX2 

ZJ = (3.*Z (KL, J) -4 .*Z (KL, Jl) +Z (KL, J2) ) *DX2 
END IF 

ZZ (L, 1 ) = (YJ*ZK-ZJ*YK) *RJ 
ZZ (L, 2 ) = (XK*ZJ-XJ*ZK) *RJ 
ZZ (L, 3) = (XJ*YK-YJ*XK) *RJ 

ZZ (L, 4) = -OMEGA* (Z (KL,J) *ZZ (L, 2) -Y (KL, J) *ZZ (L, 3) ) 


10 CONTINUE 

C 

CALL RECV (L 1) 

CALL RECV (KL 1) 

CALL RECV (KP 1) 

CALL RECV (KR 1) 

CALL RECV (RJ 1) 

CALL RECV (XK 1) 

CALL RECV (YK 1) 

CALL RECV (ZK 1) 

CALL RECV (Kl 1) 

CALL RECV (K2 1) 

CALL RECV (XJ 1) 

CALL RECV (YJ 1) 

CALL RECV (ZJ 1) 

CALL RECV (Jl 1) 

CALL RECV (J2 1) 

KQQ=L2 

LZZ=Ll+0+NQQ0 


MZZ=MIN (LZZ+0+2*NQQ0-l,L2) 

CALL RECV (ZZ (KZZ , * ) KZZ=LZZ, MZZ , 2 ) 

LZZ=Ll + 0+2 *NQQ0 

MZZ=MIN (LZZ+0+3*NQQ0-l, L2) 

CALL RECV (ZZ (KZZ, *) KZZ=LZZ,MZZ, 3) 

LZZ=Ll+0+3*NQQ0 

MZZ=MIN (LZZ+0 + 4 *NQQ0-1 , L2 ) 

CALL RECV (ZZ (KZZ, *) KZZ=LZZ,MZZ, 4) 
END 


9 



FILE 


2 


{Processor 2} 


C 


C 


C 


C 


C 


C 


CALL RECV (L2 1) 

CALL RECV (LI 1) 

NQQ0= (L2- (LI) +1-0) DIV4+1 
LQQ0=Ll+0+NQQ0 
MQQ0=MIN (LQQ0+0+2*NQQ0-l,L2) 

DO 10 L=LQQ0 ; MQQO 
IF (L .EQ. LQQO) THEN 

CALL RECV (K 1) 

CALL RECV (ND 1) 


END IF 

KL = (L-l) *ND+K 
KP = KL+1 
KR = KL-1 

IF (L .EQ. LQQO) THEN 

CALL RECV (J 1) 


CALL RECV ( Q ( * , * # J) KZZ=LQQ0 , MQQO , 1 ) 

ENDIF 

RJ = Q (KL, 6, J) 

IF (L .EQ. LQQO) THEN 

CALL RECV (KMAX 1) 


ENDIF 

IF ( (K.NE.l) .AND. (K.NE.KMAX) ) THEN 
IF (L .EQ. LQQO) THEN 

CALL RECV (DY2 1) 


CALL RECV (X (*/ *) KZZ=LQQ0 , MQQO , 1) 
ENDIF 

XK = (X (KP, J) -X (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 


CALL RECV ( Y ( * , * ) KZZ=LQQ0 , MQQO , 1 ) 
ENDIF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 


CALL RECV (Z (*, *)KZZ=LQQ0, MQQO, 1) 

ENDIF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF (K.EQ.l) THEN 
K1 = KL+1 
K2 = KL+2 

XK = (-3 . *X (KL, J) +4 . *X (Kl, J) -X (K2, J) ) *DY2 
YK = (-3.*Y(KL,J)+4.*Y(K1,J)-Y(K2,J))*DY2 
ZK = (~3.*Z(KL, J)+4.*Z(K1,J)-Z(K2, J))*DY2 
ELSE IF (K . EQ . KMAX) THEN 
Kl = KL-1 
K2 = KL-2 

XK = (3.*X(KL,J)-4.*X(K1,J)+X(K2,J))*DY2 



1 ) 


YK = (3. *Y (KL, J) -4.*Y (Kl, J) +Y (K2, J) ) *DY2 
ZK = (3 . *Z (KL, J) -4.*Z (Kl, J) +Z (K2, J) ) *DY2 
END IF 

IF (L .EQ. LQQO) THEN 
CALL RECV (JMAX 


END IF 

IF( (J.NE.l) .AND. (J.NE. JMAX) ) THEN 
IF (L .EQ. LQQO) THEN 

CALL RECV (DX2 1) 

CALL RECV (JR 1) 

CALL RECV (JP 1) 


END IF 

XJ = (X (KL, JP) -X (KL, JR) ) *DX2 
YJ = (Y(KL, JP) -Y(KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF(J.EQ.l) THEN 
J1 ■ J+l 
J2 = J+2 

XJ = ( - 3 . *X (KL, J) +4 . *X (KL, J1 ) -X (KL, J2 ) ) *DX2 
YJ = (-3. *Y (KL, J) +4 . *Y (KL, Jl) -Y(KL, J2) ) *DX2 
ZJ = (-3 . *Z (KL, J) +4 . *Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 
J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 
YJ = ( 3 . * Y (KL, J) -4 . * Y (KL, Jl ) +Y (KL, J2 ) ) *DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 
END IF 

ZZ (L, 1 ) = ( Y J*ZK-Z J* YK) *RJ 
ZZ (L, 2) = (XK*Z J-XJ*ZK) *RJ 
ZZ (L, 3) = (XJ* YK-YJ*XK) *RJ 
IF (L .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 

C 

END IF 

ZZ (L, 4 ) = -OMEGA* (Z (KL, J) *ZZ (L, 2) -Y (KL, J) *ZZ (L, 3) ) 
10 CONTINUE 
KQQ=L2 

IF (L .EQ. KQQ) THEN 


CALL SEND (L 1) 

CALL SEND (KL 1) 

CALL SEND (KP 1) 

CALL SEND (KR 1) 

CALL SEND (RJ 1) 

CALL SEND (XK 1) 

CALL SEND (YK 1) 

CALL SEND (ZK 1) 

CALL SEND (K1 1) 

CALL SEND (K2 1) 

CALL SEND (XJ 1) 

CALL SEND (YJ 1) 



1 ) 

1 ) 

1 ) 


CALL SEND (ZJ 
CALL SEND (J1 
CALL SEND (J2 
END IF 


CALL SEND (ZZ (KZZ, *) KZZ=LQQO , MQQO , 1) 
END 


9 


9 



FILE 

3 {Processor 3} 



CALL RECV (L2 

1) 


CALL RECV (LI 

NQQ0= (L2- (LI) +1-0) DIV4+1 

LQQ0=Ll+0+2*NQQ0 

MQQ0=MIN (LQQ0+0+3*NQQ0-l, L2) 

DO 10 L=LQQ0, MQQO 

IF (L .EQ. LQQO) THEN 

1) 


CALL RECV (K 

1) 


CALL RECV (ND 

1) 

c — 

END IF 

KL = (L-l ) *ND+K 
KP = KL+1 
KR = KL-1 

IF (L .EQ. LQQO) THEN 



CALL RECV (J 

1) 

C 

CALL RECV (Q ( * / * / J) KZZ=LQQ0 , MQQO , 1 ) 
END IF 

RJ = Q (KL, 6, J) 

IF (L .EQ. LQQO) THEN 



CALL RECV (KMAX 

1) 

c --- 

ENDIF 

IF ( (K.NE. 1) .AND. (K.NE.KMAX) ) THEN 
IF (L .EQ. LQQO) THEN 



CALL RECV (DY2 

1) 

c — 

CALL RECV (X (*, *)KZZ=LQQ0, MQQO, 1) 
ENDIF 

XK * (X (KP, J) -X (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 


c 

CALL RECV (Y (*, *)KZZ=LQQ0, MQQO, 1) 
ENDIF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 


c — 

CALL RECV (Z (*, *)KZZ=LQQ0, MQQO, 1) 
ENDIF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF (K . EQ . 1 ) THEN 
K1 = KL+1 
K2 = KL+2 



XK = (-3 . *X (KL, J) +4 . *X (Kl, J) -X (K2, J) ) 

*DY2 


YK = (-3 . *Y (KL, J) +4 . *Y (Kl, J) -Y (K2, J) ) 

*DY2 


ZK = (-3. *Z (KL,J) +4.*Z (K1,J) -Z (K2,J) ) 
ELSE IF (K . EQ . KMAX) THEN 
Kl = KL-1 
K2 = KL-2 

*DY2 


XK = (3 . *X (KL, J) -4 . *X (Kl, J) +X (K2, J) ) *DY2 



1 ) 


YK = (3 . *Y (KL, J) -4 . *Y (Kl, J) +Y (K2, J) ) *DY2 
ZK = (3.*Z(KL,J)-4.*Z(K1,J)+Z(K2,J))*DY2 
ENDIF 

IF (L .EQ. LQQO) THEN 
CALL RECV ( JMAX 


ENDIF 

IF( (J.NE.l) .AND. (J.NE.JMAX) ) THEN . 

IF (L .EQ. LQQO) THEN 

CALL RECV (DX2 1) • 

CALL RECV (JR 1) 

CALL RECV (JP 1) 


ENDIF 

XJ = (X(KL, JP) -X(KL, JR) ) *DX2 
YJ = (Y (KL, JP) -Y (KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF(J.EQ.l) THEN 
J1 = J+l 

J2 = J+2 

XJ = (-3. *X(KL, J) +4 . *X (KL, Jl) -X(KL, J2) ) *DX2 
YJ = (-3. *Y (KL, J) + 4.*Y (KL, Jl) -Y (KL, J2) ) *DX2 
ZJ = (-3 . *Z (KL, J) +4 . *Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
Jl = J-l 

J2 = J-2 

XJ = (3 . *X (KL, J) -4 . *X (KL, Jl) +X (KL, J2) ) *DX2 

YJ = (3 . *Y (KL, J) -4 . *Y (KL, Jl) +Y (KL, J2) ) *DX2 

ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 
ENDIF 

ZZ (L, 1) = (YJ*ZK-ZJ*YK) *RJ 
ZZ (L, 2 ) = (XK*ZJ-XJ*ZK) *RJ 
ZZ (L, 3) = (XJ* YK- Y J*XK) *RJ 
IF (L .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 

C 

ENDIF 

ZZ (L, 4) = -OMEGA* (Z (KL, J) *ZZ (L, 2) -Y (KL, J) *ZZ (L, 3) ) 
10 CONTINUE 


KQQ=L2 
IF (L .EQ. 
CALL SEND 

KQQ) THEN 
(L 

1) 

CALL 

SEND 

(KL 

1) 

CALL 

SEND 

(KP 

1) 

CALL 

SEND 

(KR 

1) 

CALL 

SEND 

(RJ 

1) 

CALL 

SEND 

(XK 

1) 

CALL 

SEND 

(YK 

1) 

CALL 

SEND 

(ZK 

1) 

CALL 

SEND 

(Kl 

1) 

CALL 

SEND 

(K2 

1) 

CALL 

SEND 

(XJ 

1) 

CALL 

SEND 

(YJ 

1) 



CALL SEND (ZJ 
CALL SEND (J1 
CALL SEND (J2 
END IF 


CALL SEND (ZZ (KZZ, *) KZZ=LQQO , MQQO , 1) 
END 


FILE 


4 


{Processor 4} 


1 ) 

1 ) 


CALL RECV (L2 
CALL RECV (LI 

NQQ0= (L2- (LI) +1-0) DIV 4+1 
LQQ0=Ll+0+3*NQQ0 
MQQ0=L2 

DO 10 L=LQQ0, MQQO 
IF (L .EQ. LQQO) THEN 
CALL RECV (K 
CALL RECV (ND 

C 

END IF 

KL = (L-l) *ND+K 
KP = KL+1 
KR = KL-1 

IF (L .EQ. LQQO) THEN 
CALL RECV (J 1) 

C 

CALL RECV (Q (*, * , J) KZZ=LQQ0 , MQQO, 1) 

ENDIF 

RJ = Q (KL, 6, J) 

IF (L .EQ. LQQO) THEN 

CALL RECV (KMAX 1) 

c 

ENDIF 

IF ( (K.NE. 1) .AND. (K.NE.KMAX) ) THEN . 

IF (L .EQ. LQQO) THEN 

CALL RECV (DY2 1) 

c 

CALL RECV (X (*, *) KZZ=LQQ0, MQQO, 1) 

ENDIF 

XK = (X (KP, J) -X (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 

C 

CALL RECV ( Y ( * , * ) KZZ=LQQ0 , MQQO , 1) 

ENDIF 

YK = (Y (KP, J) -Y (KR, J) ) *DY2 
IF (L .EQ. LQQO) THEN 

C 

CALL RECV (Z ( * , * ) KZZ=LQQ0 , MQQO , 1 ) 

ENDIF 

ZK = (Z (KP, J) -Z (KR, J) ) *DY2 
ELSE IF (K.EQ. 1) THEN 
K1 = KL+1 
K2 = KL+2 

XK = (-3 . *X (KL, J) +4 . *X (Kl , J) -X (K2 , J) ) *DY2 
YK = (-3.*Y(KL,J)+4.*Y(K1,J)-Y(K2,J))*DY2 
ZK = (~3.*Z (KL, J) +4.*Z (Kl, J) -Z (K2, J) ) *DY2 
ELSE IF (K.EQ. KMAX) THEN 
Kl = KL-1 
K2 = KL-2 

XK = (3.*X(KL, J)-4.*X(K1, J)+X(K2, J))*DY2 


1 ) 

1 ) 



c 


c 


c 


YK = (3. *Y (KL, J) -4 . *Y <K1, J) +Y (K2, J) ) *DY2 
ZK = <3.*Z(KL,J)-4.*Z(K1,J)+Z(K2,J))*DY2 


ENDIF 

IF (L .EQ. LQQO) THEN 

CALL RECV (JMAX 1) 


ENDIF 

IF ( (J.NE.l) .AND. (J.NE. JMAX) ) THEN 
IF (L .EQ. LQQO) THEN 

CALL RECV (DX2 1) 

CALL RECV (JR 1) 

CALL RECV (JP 1) 


ENDIF 

XJ = (X(KL, JP) -X(KL, JR) ) *DX2 
YJ = (Y(KL, JP)-Y(KL, JR) ) *DX2 
ZJ = (Z (KL, JP) -Z (KL, JR) ) *DX2 
ELSE IF(J.EQ.l) THEN 
J1 = J+l 

J2 = J+2 

XJ = (-3 . *X (KL, J) +4 . *X (KL, J1 ) -X (KL, J2 ) ) *DX2 
YJ = ( -3 . * Y (KL, J) +4 . * Y (KL, J1 ) -Y (KL, J2 ) ) *DX2 
ZJ = (-3.*Z (KL, J) +4 .*Z (KL, Jl) -Z (KL, J2) ) *DX2 
ELSE IF (J.EQ. JMAX) THEN 
J1 = J-l 

J2 = J-2 

XJ = ( 3 . *X (KL, J) -4 . *X (KL, J1 ) +X (KL, J2 ) ) *DX2 

YJ = (3 . *Y (KL, J) -4 . *Y (KL, Jl) +Y (KL, J2) ) *DX2 
ZJ = (3 . *Z (KL, J) -4 . *Z (KL, Jl) +Z (KL, J2) ) *DX2 
ENDIF 

ZZ (L, 1 ) = ( YJ*ZK-Z J*YK) *RJ 
ZZ (L, 2) = (XK*Z J-XJ*ZK) *RJ 
ZZ (L, 3) = (XJ* YK-YJ*XK) *RJ 
IF (L .EQ. LQQO) THEN 

CALL RECV (OMEGA 1) 


ENDIF 

ZZ (L, 4 ) = -OMEGA* (Z (KL, J) *ZZ (L,2) -Y (KL, J) *ZZ (L, 3) ) 
10 CONTINUE 
KQQ=L2 

IF (L .EQ. KQQ) THEN 


CALL 

SEND 

(L 

1) 

CALL 

SEND 

(KL 

1) 

CALL 

SEND 

(KP 

1) 

CALL 

SEND 

(KR 

1) 

CALL 

SEND 

(RJ 

1) 

CALL 

SEND 

(XK 

1) 

CALL 

SEND 

(YK 

1) 

CALL 

SEND 

(ZK 

1) 

CALL 

SEND 

(K1 

1) 

CALL 

SEND 

(K2 

1) 

CALL 

SEND 

(XJ 

1) 

CALL 

SEND 

(YJ 

1) 



CALL 

SEND 

(ZJ 

1) 

CALL 

SEND 

(J1 

1) 

CALL 

SEND 

( J2 

1) 

ENDIF 



CALL 

SEND (ZZ (KZZ, * ) KZZ=LQQO , MQQO , 1) 



END 







9 


9 



